# optimizing_notears_objectives_via_topological_swaps__60f3f994.pdf Optimizing NOTEARS Objectives via Topological Swaps Chang Deng 1 Kevin Bello 1 2 Bryon Aragam 1 Pradeep Ravikumar 2 Recently, an intriguing class of non-convex optimization problems has emerged in the context of learning directed acyclic graphs (DAGs). These problems involve minimizing a given loss or score function, subject to a non-convex continuous constraint that penalizes the presence of cycles in a graph. In this work, we delve into the optimization challenges associated with this class of nonconvex programs. To address these challenges, we propose a bi-level algorithm that leverages the non-convex constraint in a novel way. The outer level of the algorithm optimizes over topological orders by iteratively swapping pairs of nodes within the topological order of a DAG. A key innovation of our approach is the development of an effective method for generating a set of candidate swapping pairs for each iteration. At the inner level, given a topological order, we utilize off-the-shelf solvers that can handle linear constraints. The key advantage of our proposed algorithm is that it is guaranteed to find a local minimum or a KKT point under weaker conditions compared to previous work and finds solutions with lower scores. Extensive experiments demonstrate that our method outperforms state-ofthe-art approaches in terms of achieving a better score. Additionally, our method can also be used as a post-processing algorithm to significantly improve the score of other algorithms. Code implementing the proposed method is available at https://github.com/duntrain/topo. 1Booth School of Business, University of Chicago, USA. 2Machine Learning Department, Carnegie Mellon University, USA. Correspondence to: Chang Deng , Kevin Bello . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1. Introduction We study a class of constrained nonconvex optimization problems defined as follows: min Θ Q(Θ) subject to h(W(Θ)) = 0, (1) where Θ Rl corresponds to all model parameters, and W(Θ) Rd d is a weighted adjacency matrix representing the structure of a directed graph of d nodes induced by Θ. Moreover, Q : Rl R is a (possibly non-convex) differentiable function that we will refer to as the score or loss function; while h : Rd d [0, ) is a nonnegative non-convex differentiable function that penalizes cycles in the weighted adjacency matrix W(Θ), and whose level set at zero corresponds to directed acyclic graphs (DAGs). The class of problems (1) arose in the paper by Zheng et al. (2018) in the context of learning the underlying structure of a structural equation model (SEM), typically assumed to be a DAG. In Zheng et al. (2018), the challenges of combinatorial optimization were replaced by those of differentiable non-convex optimization. While global optimality remains intractable in general, the key advantage of the class of problems (1) is that it admits the use of general purpose non-linear optimizers. Due to the latter, several studies have built upon the work of Zheng et al. (2018), usually by either proposing a new characterization of h (e.g., Yu et al., 2019; Bello et al., 2022), or using different score functions Q (e.g., Zheng et al., 2018; 2020; Ng et al., 2020; Yu et al., 2019; Lachapelle et al., 2020). All, however, with a clear lack of optimality guarantees. Based on these formulations, Wei et al. (2020) and Ng et al. (2022) studied some of the optimization-theoretic curiosities associated with this class of problems. Wei et al. (2020) provides local optimality guarantees assuming linear models and a convex score Q, while Ng et al. (2022) studies the convergence challenges of (1). The focus of our work is studying optimality for the class of problems (1) in a more general setting, i.e., admitting a possibly non-convex score Q and nonlinear models. We pay close attention to the Karush-Kuhn-Tucker (KKT) optimality conditions, building upon similar results first studied in Wei et al. (2020). The KKT conditions are known to be a necessary first-order characterization of optimal solutions under some regularity Optimizing NOTEARS Objectives via Topological Swaps conditions, and form the backbone of nonlinear programming (Bertsekas, 1997; Boyd et al., 2004). More specifically, we show that by an equivalent reformulation of the KKT conditions, we can find better solutions to (1) that is, KKT points and/or local minima with better (i.e. lower) score while also relaxing the conditions required in previous work. The key idea is to relate the KKT conditions to an optimal topological sort and leverage the fact that solving the continuous program for a fixed ordering is often tractable. Although not every topological sort corresponds to a local minimum in the continuous formulation, we show that our method can indeed be rigorously interpreted as iteratively selecting better and better local minimizers until no improvement can be found. Our method also avoids explicitly enforcing the acyclicity constraint h, and instead uses the continuous characterization indirectly via the KKT conditions. Contributions. To this end, we make the following specific contributions: 1. We propose a bi-level optimization algorithm, in which the outer level optimizes over topological orders and the inner level optimizes the score given a specific order. To optimize over orders, we use a novel technique for selecting candidate pairs of nodes to be swapped, which is described in detail in Section 4. This approach involves iteratively swapping pairs of nodes within the topological order of a DAG, and utilizes the KKT conditions as a guide for determining which pairs to consider swapping. To optimize the score given a specific order, we utilize state-of-the-art solvers that are able to solve the problems to stationary points. 2. We prove that our method searches between local minima and strictly decreases the score at each iteration (Section 4.3). We furthermore show that our method provably finds local minimizers under strictly weaker conditions compared to previous work (Lemma 4). In particular, we show that the concept of irreducibility introduced in Wei et al. (2020) is not necessary to ensure local optimality, and provide an explicit example as demonstration (Appendix C.2). 3. We conduct a comprehensive set of experiments in multiple settings to evaluate the performance of our algorithm against state-of-the-art methods for solving problem (1). The results of our experiments, summarized in Section 5, demonstrate that our method is able to find minimizers with lower scores (compared to existing algorithms) that are guaranteed to be either local minima or KKT points. An attractive feature of our method is its flexibility as it can be used both as a standalone algorithm and as a postprocessing step when provided with a pre-computed DAG as an initialization. Although the underlying optimization problem is nonconvex and plagued by poor local minima, our results demonstrate that it is still possible to discover suitable local minima with improved scores. This is a noteworthy achievement given that nonconvex problems of this nature are often considered challenging and difficult to optimize. 2. Related Work Most closely related to our work are methods that build on the non-convex continuous constrained formulation of Zheng et al. (2018), (e.g., Yu et al., 2019; Zheng et al., 2020; Lachapelle et al., 2020; Ng et al., 2020; Zhu et al., 2020; Romain & d Aspremont, 2020; Bello et al., 2022). In contrast to this previous work, our focus is on optimality conditions, i.e. ensuring that we find a DAG that satisfies the KKT optimality conditions (in fact, it will be a local minimizer) of an equivalent formulation to that of Zheng et al. (2018). Similar to our work, recent work (Wei et al., 2020; Ng et al., 2022) has begun to study the optimization-theoretic aspects of this problem. In contrast to Wei et al. (2020), which is only guaranteed to return some local minimizer, our method iteratively jumps from one local minimizer to another until a stopping criterion is met. The latter allows our method to seek out for more favorable local minimizers, that is, DAGs that attain lower scores. Ng et al. (2022) studies a different question, namely the convergence of methods for solving these problems. Although our emphasis is on optimization, it is useful to provide some context from the graphical modeling literature as well. Most algorithms for learning DAGs fall into two main categories: score-based methods that optimize a score function, and constraint-based methods that use independence tests. Since the program (1) is modeled after traditional score-based methods, we only mention a few classical constraint-based algorithms such as: the PC algorithm (Spirtes & Glymour, 1991), a general algorithm that learns the Markov equivalence class; max-min parents and children (MMPC, Tsamardinos et al., 2006); and a variety of algorithms based on local Markov boundary search such as grow-shrink (GS, Margaritis & Thrun, 1999; Margaritis, 2003) and incremental association (IAMB, Tsamardinos et al., 2003). Score based methods assign a score to a candidate DAG structure based on how well it fits the observed data, and then attempts to find the highest scoring structure. Classical score functions include the log-likelihood based BIC and AIC scores as well as Bayesian scores under different parameter priors (Geiger & Heckerman, 2002). Other related work that study the Gaussian setting are given by Aragam & Optimizing NOTEARS Objectives via Topological Swaps Zhou (2015); Ghoshal & Honorio (2017; 2018), and in the non-Gaussian case by Loh & Bühlmann (2014). On the side of approximate algorithms, notable methods include greedy search (Chickering, 2003), order search (Teyssier & Koller, 2005; Scanagatta et al., 2015; Park & Klabjan, 2017), and the LP-relaxation based method proposed by Jaakkola et al. (2010). There are also exact algorithms such as GOBNILP (Cussens, 2012) and bene (Silander & Myllymaki, 2006). Another line of work (Teyssier & Koller, 2005; Xiang & Kim, 2013; Raskutti & Uhler, 2018; Drton et al., 2018; Ye et al., 2020; Squires et al., 2020; Solus et al., 2021; Wang et al., 2021) studies order-based methods which bear a superficial relationship to our algorithm, but it is worth emphasizing that none of them theoretically analyze optimization properties such as KKT theory, local optimality guarantees or apply to arbitrary smooth losses. More specifically, (Ye et al., 2020) is restricted to log-likelihood based scores and (Raskutti & Uhler, 2018; Squires et al., 2020; Solus et al., 2021) require faithfulness (related) assumptions. (Silander & Myllymaki, 2006; Xiang & Kim, 2013) are exact methods that only work with a small number of nodes. 3. Notation and Background In this section, we establish the notation and provide context for the class of problems (1). 3.1. Nonlinear DAG models We let G = (V, E) denote a directed graph of d nodes, with vertex set V = [d] := {1, . . . , d} and edge set E V V , where (i, j) E indicates the presence of a directed edge from node i to node j. For a graph G, we associate each node i V to a random variable Xi, and use X = (X1, . . . , Xd) to denote the d-dimensional random vector. We consider structural equation models (SEMs Peters et al., 2017), in which each node Xj is determined by a function fj : Rd R of its parents and independent noise z = (z1, . . . , zd) Rd as follows: Xj = fj(X, zj), kfj = 0 if k / PAG j , (2) where PAG j = {i V | (i, j) E} denotes the set of parents of node j in G. Note that we write fj as a function of every other variable, and separately impose a restriction on the dependence through the partial derivatives, as in Zheng et al. (2020). This is equivalent to the usual formulation Xj = fj(PAG j , zj), and is adopted for mathematical convenience in the sequel. Standard examples of SEMs include linear SEMs (e.g., Peters & Bühlmann, 2014; Loh & Bühlmann, 2014) and additive noise models (Peters et al., 2014). With this notation, the graphical structure implied by an SCM f = (f1, . . . , fd) can be represented by the following d d weighted adjacency matrix: W = W(f) = (wij), wij = ifj 2. (3) In practice, a family of functions is defined to approximate the nonlinear functions fj; common examples include multilayer perceptrons (MLP) (Zheng et al., 2020; Lachapelle et al., 2020), and basis expansions (Zheng et al., 2020; Bühlmann et al., 2014). See Appendix A for a detailed discussion on these families of functions. We use Θ to denote all the model parameters used for approximating f. However, not all of these parameters are utilized for inducing the graphical structure implied by f. To differentiate, we use θ Θ to denote the subset of parameters that are used for inducing the weighted adjacency matrix W, and θ = Θ \ θ to denote the remaining model parameters. In other words, we have the following relationship: W(f) = W(Θ) = W(θ). To simplify notation and improve the clarity of presentation, we present the case where there is a single parameter θij 1 per candidate edge (i, j), i.e., [W(θ)]ij = θij and W(θ) = θ. However, note that all of our results hold for the general case and are thoroughly treated in the technical proofs provided in Appendix D. 3.2. Score functions The class of programs (1) requires a loss/score function Q. We briefly review commonly used scores in the literature. Let X = [x1, , xd] Rn d denote the observed data matrix. Let Θi denote the parameters used to approximate fi, we use fΘi to denote fi approximated by Θi. Since the score function depends on the observed data, in this subsection, we use Q(Θ; X) to denote the score on Θ given X. Then, some possible score functions include: Least squares. Q(Θ; X) = 1 2n Pd i=1 xi fΘi(X) 2 2 for linear SEMs with equal noise variances (Loh & Bühlmann, 2014). Negative log-likelihood. Q(Θ; X) = 1 2 Pd i=1 log( xi fΘi(X) 2 2) for additive SEMs with Gaussian errors (Bühlmann et al., 2014). Logistic loss. Let 1n denote the n-dimensional vector of ones. Then, we have Q(Θ; X) = 1 n Pd i=1 1 n (log(1n + exp(fi(X))) xi fΘi(X)) for generalized linear models with binary variables (Zheng et al., 2020). In the sequel, we simplify notation by writing Q(Θ) instead of Q(Θ; X). Remark 1. It is important to emphasize that in practical applications, the choice of score Q is crucial: In order 1θij can be a vector, it is required that [W(θ)]ij = 0 if and only if θij = 0, see Appendix A for more discussions. Optimizing NOTEARS Objectives via Topological Swaps for solutions to this problem to be useful, ideally the minimizer(s) of Q should correspond to the true underlying DAG. This problem has been extensively studied (Geiger & Heckerman, 2002; Chickering, 2003; Van de Geer & Bühlmann, 2013; Loh & Bühlmann, 2014; Nandy et al., 2018; Aragam et al., 2019), so we do not pursue it further here. For example, in recent work, Reisach et al. (2021) show how certain scores are not scale invariant, which may be an issue in practice, but is simply an artifact of the score function, as originally pointed out by Loh & Bühlmann (2014). By contrast, our explicit goal is to study the optimization-theoretic aspects of objectives (1), and not to propose new algorithms for learning causal DAGs. 3.3. Continuous non-convex characterizations of DAGs To conclude this section, we next provide a brief overview of the existing options for the function h. We remind the reader that for presentation simplicity we have W(θ) = θ, as discussed at the end of Section 3.1. Condition 1. The function h has the following form: i=1 ci Tr(Bi), where ci > 0 for any i. Corollary 1 (Wei et al., 2020 Theorem 1). If h satisfies Condition 1, then we have that h(B) = 0 if and only if B corresponds to a DAG, for any nonnegative matrix B. By now the literature contains many different proposals of functions h that satisfy Condition 1; in this paper, we mostly focus on the following three: 1. The NOTEARS formulation. Zheng et al. (2018) were the first to propose a differentiable characterization of DAGs given by h(B) = Tr(e B) d for a nonnegative matrix B. 2. A polynomial formulation. Yu et al. (2019) proposed the use of h(B) = Tr((I + 1/d B)d) d for a nonnegative matrix B. 3. The DAGMA formulation. Bello et al. (2022) proposed the use of h(B) = log det(I B) for a nonnegative matrix B with spectral radius less than one. Note that B above is commonly defined as B = θ θ, where denotes the Hadamard product. In that case, it has been shown that θh(θ θ) = 0 if and only if θ is a DAG (see Wei et al., 2020). The latter implies that all stationary points of h are global minima of h, a property known as invexity, as highlighted by Bello et al. (2022). Remark 2. Our results are general and apply to any function h satisfying Condition 1. Thus, our results apply to any of the three h functions mentioned above. 3.4. Necessary and sufficient conditions for optimality Wei et al. (2020) first studied (1) from an optimality perspective. The authors argued that the use of the Hadamard product θ θ leads to an undesirable property, namely, any feasible Θ in (1) cannot satisfy regularity conditions. Motivated by this negative result, Wei et al. (2020) proposed an alternative, yet equivalent, formulation by replacing h(θ θ) by h(|θ|). Reasoning similarly, we reformulate (1) as min Θ Q(Θ) subject to h(|θ|) 0. (4) By writing θ = θ+ θ , where θ+ = max{θ, 0} and θ = max{ θ, 0} denote the positive and negative parts of θ, respectively. Then, an equivalent smooth formulation is given by min θ+,θ , θ Q((θ+ θ , θ)) (5) subject to h(θ+ + θ ) = 0, and θ+, θ 0. For clarity, we remind the reader that in (5) we have Θ = (θ+, θ , θ). Then, the KKT conditions for (5) can be succinctly written as follows: θ+ ij + λ h(θ+ + θ ) θ+ ij = M + ij 0, (6a) θ ij + λ h(θ+ + θ ) θ ij = M ij 0, (6b) θ+ ij M + ij = θ ij M ij = 0, (6c) θ = 0, (6d) in addition to the feasibility conditions in (5). where M and λ are the Lagrange multipliers of the constraints on θ and h, respectively. Here λ R, M 0. Briefly, (6a), (6b) and (6d) results from dual feasibility and the stationarity condition, while (6c) stems from complementary slackness. The following useful theorem from Wei et al. (2020) establishes the connection between KKT satisfiability in (5) and local minimality in (4) for linear SEMs (i.e., θ = ). Theorem 1 (Wei et al., 2020, Theorem 7). Assume that Q is convex, h satisfies the Condition 1, and θ = . If (θ+, θ ) satisfies the KKT conditions in (6), then θ+ θ is a local minimum of (4). A key ingredient of our developments in the sequel is the following alternative characterization of the KKT conditions, which turns out to provide an algorithmically amenable firstorder sufficient condition for local optimality. We include a proof in Appendix D. Lemma 1. If Θ = (θ+, θ , θ) satisfies the following conditions: Optimizing NOTEARS Objectives via Topological Swaps (i) For {(i, j) | [ h(θ+ + θ )]ij > 0} θ ij = 0. (ii) For {(i, j) | [ h(θ+ + θ )]ij = 0} Q(Θ) (iv) θ+ 0, θ 0. Then, we have that Θ is a KKT point of (5). Moreover, if the score Q is convex, any such Θ = (θ+ θ , θ) is also a local minimum for problem (4). Remark 3. If the score Q is smooth but non-convex, then we can no longer use Lemma 1 to automatically promote KKT points to local minima. Thus, in the sequel, whenever the score Q is non-convex, all claims about local minima must be demoted to KKT points. 4. Optimization Algorithm: Topological Swaps Our key idea is to solve (4) as a two-staged problem: in the inner stage, we solve (4) to an additional constraint that makes the problem tractable, and in the outer stage, we search over the set of constraints. The critical innovation is in using our reformulation of the KKT conditions in guiding this search. Our specific set of constraints relies on imposing an ordering over the variables. We briefly review such order constrained optimization below before formally introducing our overall approach. 4.1. Background: Order-constrained optimization We leverage the following well-known observation: For a fixed topological sort, problem (4), or equivalently (5), can often be solved efficiently. We briefly review this material here for completeness. Recall that a topological sort (or order) for G is a partial ordering on the vertex set V = [d] such that Xi Xj = i j, here Xi Xj means there exists an edge from i to j. A directed graph is acyclic if and only if it has a topological sort, although this sort may not be unique. Equivalently, we can view a topological sort as a permutation on V . Definition 1 (Topological sort). A topological sort defines a permutation π of the vertex set V for G by letting π(j) be the j-th node in the ordering defined by . In other words, if Xπ(i) Xπ(j), then i < j. A similar definition carries over in the obvious way for weighted adjacency matrices θ. We furthermore call G (resp. θ) consistent with π if π is a topological sort of G (resp. θ), and write this as G π (resp. θ π). Given a permutation π, we then have the following order- constrained optimization problem: min θ π Q(Θ). (7) Due to the order consistency constraint θ π, the acyclicity constraint h(|θ|) 0 is automatically satisfied and hence can be omitted from (7). We next reformulate (7) with explicit linear constraints. Moreover, in the sequel, we use Θ π to denote any solution to this problem: Θ π = (θ π, θ π) arg min Θ Q(Θ) (8) subject to θπ(i),π(j) = 0, j < i. Remark 4. Our results only require solving (8) up to stationarity. That is, we can first set θπ(i),π(j) = 0, for all j < i, and then use any off-the-shelf first-order optimizer (Boyd et al., 2004; Nesterov et al., 2018) for the resulting (non)convex unconstrained problem. 4.2. Algorithm Motivated by the observations above, we propose a general bi-level algorithm based on finding the topological sort π of an optimal scoring DAG. For any Θ and τ, ξ > 0, define a set Y(Θ, τ, ξ) def = n (i, j) | [ h (|θ|))]ij τ, Q(Θ) Given this machinery, the four main steps of our approach (Algorithm 1) are as follows: 1. Initialize at an arbitrary sort π, and solve (7). 2. Define a candidate set of possible swaps by Y(Θ π, τ , ξ ) as defined in (9), where (τ , ξ ) are parameters chosen adaptively such that |Y(Θ π, τ , ξ )| ssmall. 3. Choose the best swap from this set to obtain a new topological sort; i.e., the swap that decreases the score Q the most. 4. Repeat until there is no sufficient improvement in the score. There are several advantages to this approach: Enforcing acyclicity is much simpler: Once a topological sort is fixed, acyclicity is automatically guaranteed and the optimization is straightforward and efficient (cf. Section 4.1). Thus, there is no need to include h(|θ|) directly in the optimization routines compared to Zheng et al. (2018), which greatly simplifies implementation. Optimizing NOTEARS Objectives via Topological Swaps We will only need to check (ii), (iii), and (iv) in Lemma 1 in order to ensure the KKT conditions are satisfied, and computing the gradients Q, h is easy. Note that Condition (i) is to ensure |θ| is acyclic, which is always satisfied by the argument in the above item. It is worth stressing that this is not the same as greedily selecting individual edges as in GES (Chickering, 2003): Each swap re-solves (7) globally, and hence updates every edge. Crucially, in the second step, it is not necessary to exhaustively check all possible swaps: By properly exploiting the KKT conditions as in Lemma 1, we are able to limit the set of possible candidate swaps to Y(Θ π, τ , ξ ). This greatly improves the efficiency of the algorithm. Moreover, it is not necessary to find the swap that decreases the score the most in Algorithm 1 line 9. Instead, any swap that decreases Q could be used to accelerate our algorithm. This greedy strategy, which is explored in the appendices, can improve time efficiency while attaining comparable performances. The main steps of our method are summarized in Algorithm 1; a more comprehensive outline (for reproducibility purposes) can be found in the Appendix B (Algorithm 2). The subroutine FINDPARAMS (detailed in Algorithm 3 in Appendix B) aims to find appropriate values for τ and ξ such that |Y(Θ, τ, ξ)| s. In Algorithm 1, the notation ssmall and slarge are used to denote small and large search spaces, respectively. Remark 5. It is worth noting how the continuous formulation plays a critical role in Algorithm 1: We use both the KKT conditions and the function h in order to select candidate swaps (cf. (9)). 4.3. Analysis Intuitively, the idea behind Algorithm 1 is that it iteratively jumps between better and better local minimizers, until the candidate swaps given by (9) no longer offer any significant improvement in the score. This is achieved by exploiting the KKT conditions (6). In this section, we show that this is not just a heuristic: Under appropriate conditions, Algorithm 1 indeed decreases the score and always terminates at a local minimum or KKT point. Before proving this, it is worth stressing why this is not obvious a priori: Even if we solve (7) to global optimality (i.e., given the order constraint θ π), a global solution to (7) need not be a local solution to (4). This stems from the fact that a DAG can have more than one topological sort, and the solutions to (7) for each sort need not coincide. We begin with two important lemmas. Lemma 2. If (i, j) Y(Θ π, 0, 0), then (θ π)ij = 0. Algorithm 1 TOPO Require: Initial topological sort π, integers ssmall and slarge with slarge > ssmall, and score function Q. 1: {Here we use πij to denote the new topological sort by swapping nodes i and j in π.} 2: (τ , ξ ) FINDPARAMS(θ π, ssmall) 3: S Y(Θ π, τ , ξ ) 4: while S = do 5: if (i, j) S s.t. Q(Θ πij) < Q(Θ π) then 6: Update π to be πij that (most) decreases Q. 7: S Y(Θ π, τ , ξ ) 8: else 9: (τ , ξ ) FINDPARAMS(θ π, slarge) 10: S Y(Θ π, τ , ξ ) {Try a larger search space} 11: if (i, j) S s.t. Q(Θ πij) < Q(Θ π) then 12: Update π to be πij that (most) decreases Q. 13: S Y(Θ π, τ , ξ ) 14: else 15: S 16: end if 17: end if 18: end while Ensure: Θ π Lemma 3. If the score Q is separable w.r.t θ, i.e. Q(Θ) = P j Qj(θj, θ) and Y(Θ π, 0, 0) = for some topological sort π, then Q(Θ πij) < Q(Θ π), for every (i, j) Y(Θ π, 0, 0). Lemma 3 has an important takeaway message: As long as we can find a pair of nodes (i, j) Y(Θ π, 0, 0) i.e. Y(Θ π, 0, 0) = then we can find another topological sort with strictly smaller score. The difficult case is when Y(Θ π, 0, 0) = : What Algorithm 1 does is increase the thresholds (τ , ξ ) just enough to make Y(Θ π, τ , ξ ) = . Indicated by the previous observation, this suggests that placing node i before node j is likely (but not guaranteed) to decrease the score. There are many strategies for updating the topological sort to make this happen, but we adopt the simplest way, i.e., swapping the node i and node j. This previous discussion can be made more concrete via the following observation: Corollary 2. If Y(Θ π, 0, 0) = , then Θ π satisfies the KKT conditions in (6). The following definition relates to the score Q and is a relevant property for Theorem 2. Definition 2 (Connected estimator). Given a topological sort π, the estimator Θ π is called connected if for any i < j there is a directed path from node π(i) to node π(j) in θ π. Optimizing NOTEARS Objectives via Topological Swaps Equivalently, for any i < j, a connected estimator satisfies [ h(|θ π|)]π(j),π(i) > 0. In general, we expect an estimator to be connected when sparse regularization is not used. It is worth noting that NOTEARS (Zheng et al., 2018) without explicit ℓ1 regularization is observed to return a connected estimator. Theorem 2. For any h satisfies the Condition 1. If the score Q is convex (resp. non-convex) and Θ π is connected for all π. Then Algorithm 1 returns a local minimum (resp. KKT point) of problem (4), where the score is decreased at each iteration. Moreover, the solution at each iteration is also a local minimum (resp. KKT point). Remark 6. Although the proof of Theorem 2 is deceptively simple, we stress that it is not a priori obvious that swapping pairs of nodes will always decrease the score: Done naïvely, this could increase the score. Our careful use of the KKT conditions precludes this behavior. The connected estimator assumption in Theorem 2 can be dropped whenever the score Q is separable (e.g., least squares). Theorem 3. For any h satisfies the Condition 1. Assume that the score Q is separable w.r.t θ, i.e., Q(Θ) = P j Qj(θj, θ). If the score Q is convex (resp. non-convex), then Algorithm 1 returns a local minimum (resp. KKT point) of problem (4), where the score is decreased at each iteration. 4.4. Comparison to previous work Wei et al. (2020) first unveiled the connections between the KKT conditions in (6) and local minimality in (4) by studying a related problem with explicit edge absence constraints Z. As such, it is instructive to compare these two approaches since there are some important distinctions. A first clear difference is that the KKTS algorithm by Wei et al. (2020) relies on an assumption they call irreducibility, to ensure local minimality. We provide a complete discussion on the irreducibility assumption of Wei et al. (2020) in Appendix C.2 and focus on the main ideas here. Briefly, KKTS (Wei et al., 2020) uses a set of node pairs Z to indicate which edges should be absent in the graph. KKTS works by iteratively adding and removing elements to Z, and the algorithm stops once Z is an irreducible set. Then, Wei et al. (2020) show that when Z is irreducible, the KKTS solution is a local minimum, provided additional assumptions such as the score being separable and convex. In Appendix C.2, we show that irreducibility is not a necessary condition for optimality. We prove this by showing a simple example where an optimal solution can correspond to a reducible set Z. Proposition 1. Irreducibility of the set Z is sufficient but not necessary for KKTS to find a KKT point of problem (5). The above discussion should already mark a clear distinction of Algorithm 1 to KKTS, i.e., our method does not rely on the irreducibility assumption. Finally, we note that the irreducibility assumption might seem a mild condition, however, it can have a severe effect on the runtime of KKTS as it will not stop until an irreducible set is found. A second difference to KKTS is that our approach not only attempts to find an optimal solution but also attempts to find the local optimum with the lowest score possible. This fact is a direct consequence of how Algorithm 1 works, namely, at each iteration we look for a solution with lower score. The fact that KKTS does not use the score Q to guide their search procedure can result in solutions with high scores. We next provide more details. Full details can be found in Appendix C.1. Example 1. Consider the following three-node linear SEM with standard Gaussian noise zj N(0, 1), for i [3]. Consider also that the score Q is the population least square loss. X1 = z1, X2 = a X1 + z2, X3 = b X2 + z3. (10) In Appendix C.1, we show that for the linear model (10) there exists many values a and b where the solutions from KKTS and NOTEARS produce solutions with higher score w.r.t. Algorithm 1; moreover, NOTEARS produces nonoptimal solutions. This is illustrated in Appendix C.1 for a = 1, b = 0.55. In each of these examples, our method can always return a solution that satisfies the optimality conditions in Lemma (1) and also attain the lowest score. 5. Experiments Method Metric d = 20 d = 40 d = 100 GOLEM-EV KKT 0 0 0 Loss 10.7 0.12 40.7 4.8 68.8 3.9 SHD 11.4 3.4 51.4 28.3 145.2 52.6 NOTEARS KKT 0 0 0 Loss 11.9 0.1 62.1 8.8 73.1 7.6 SHD 28.6 3.2 129 25.5 140.0 30.1 NOFEARS KKT 1 1 1 Loss 11.5 0.3 47.6 1.6 61.2 2.6 SHD 23.2 4.5 69.8 16.0 87.5 19.2 NOTEARS-TOPO KKT 1 1 1 Loss 9.8 0.1 38.4 0.1 47.5 0.1 SHD 0.4 0.2 9.2 0.8 14.2 1.9 RANDOM-TOPO KKT 1 1 1 Loss 9.8 0.1 38.4 0.1 47.5 0.1 SHD 0.4 0.2 8.6 0.9 16.3 2.6 Table 1. Experiments on linear DAGs with equal-variance Gaussian noise on ER4 graphs. The score is the least squares, and d is the number of nodes. Our methods are RANDOM-TOPO, and NOTEARS-TOPO. Optimizing NOTEARS Objectives via Topological Swaps Method Metric d = 20 d = 40 d = 100 GOLEM-NV KKT 0 0 0 Loss 9.9 0.6 15.2 1.3 42.7 3.5 SHD 2.3 0.1 23.4 3.4 82.1 12.3 NOTEARS KKT 0 0 0 Loss 13.8 2.1 17.2 1.2 50.6 4.5 SHD 7.3 0.1 39.2 7.1 138.1 23.6 NOTEARS-TOPO KKT 1 1 1 Loss 8.3 1.2 13.2 2.1 35.1 2.3 SHD 2.7 3.2 26.3 4.2 86.9 6.6 RANDOM-TOPO KKT 1 1 1 Loss 8.9 1.3 14.4 1.2 39.2 4.1 SHD 3.3 0.2 29.1 4.2 106.4 11.6 Table 2. Experiments on linear DAGs with unequal-variance Gaussian noise on ER4 graphs. The score is the log-likelihood with the minimax concave penalty (MCP) penalty, and d is the number of nodes. Our methods are RANDOM-TOPO, and NOTEARS-TOPO. Method Metric d = 10 d = 20 d = 50 GOLEM-EV KKT 0 0 0 Loss 4.3 0.1 4616.5 4163.2 (7.4 7.4) 1018 SHD 6.5 0.8 85.1 6.4 1152.5 2.6 NOTEARS KKT 0 0 0 Loss 6.2 0.2 18.9 1.3 (1.7 1.6) 1011 SHD 14 1.1 79.5 2.1 1198.1 5.3 NOTEARS-TOPO KKT 1 1 1 Loss 4.97 0.1 9.92 0.1 24.9 0.3 SHD 0.1 0.1 0.7 0.2 19.4 5.5 RANDOM-TOPO KKT 1 1 1 Loss 4.97 0.1 10.3 0.2 35.8 2.1 SHD 0.1 0.1 3.1 1.4 155.6 17.9 Table 3. Experiments on Fully connected linear DAGs with Gaussian noise. The score is least squares, and d is the number of nodes. Our methods are RANDOM-TOPO, and NOTEARS-TOPO. We compare our method against state-of-the-art solvers for (1), namely, NOTEARS (Zheng et al., 2018; 2020), NOFEARS (KKTS) (Wei et al., 2020), and GOLEM (Ng et al., 2020). For TOPO (Algorithm 1), we consider the case of random initialization (denoted by starting with RAN- DOM ), and initializing at the output of NOTEARS (denoted by starting with NOTEARS ). Here, random initialization is conducted by sampling a topological sort π uniformly at random, and solving problem (7) to get Θ π. Details for each experimental setting can be found in Appendix E. Our main empirical results are shown in Tables 1, 2, 3 and 4. In all the tables, we report: Whether or not the solution of the algorithms satisfies the KKT conditions (1 indicating that the method always returned a KKT point, and 0 indicating that it never returns a KKT point); the score/loss attained by the method; and the structural Hamming distance (SHD) w.r.t. the ground-truth DAG. In Table 1, we observe that, as expected, NOFEARS and our algorithm are capable of returning a KKT point (and local minimum in this setting since Q is convex). We also note that TOPO with random initialization (Random_Topo) Method Metric d = 10 d = 20 d = 40 NOTEARS-MLP KKT 0 0 0 Loss 7.2 0.2 14.4 0.3 28.5 0.4 SHD 5.6 0.7 29.1 3.1 112.3 20.2 NOTEARS-TOPO KKT 1 1 1 Loss 6.4 0.1 11.6 0.1 22.8 0.6 SHD 2.7 0.5 12.1 36.3 20.4 TRUE KKT 1 1 1 Loss 6.3 0.1 12.2 0.1 23.4 0.4 SHD 2.1 0.5 11.6 0.6 36.1 2.2 Table 4. Experiments on Nonlinear Model with Neural Network on ER4 graphs. The score is least squares, and d is the number of nodes. Our method is NOTEARS-TOPO. Here True means the solution of problem (8) using the underlying true topological sort. performs competitively in this case, even though the initial topological sort was randomly sampled. Moreover, notice that when initialized at the output of NOTEARS, our method (Notears_Topo) improves the performance of NOTEARS dramatically. The latter demonstrates the usability of our method as a post-processing algorithm, as discussed in our contributions. In Table 2, for a non-convex score, we observe that TOPO still obtains solutions satisfying KKT optimality and achieves the lowest scores. In Table 3, we study a very challenging setting where the underlying graph is a fully connected DAG. We observe that existing methods can perform reasonably well when the number of nodes is very small (e.g., 10) but their performance degrade severely for graph with larger number of nodes. In contrast, TOPO works remarkably well in this setting, which should come to no surprise since sparsity assumptions are not required, consistent with our analysis in Section 4.3. In Table 4, we make explicit comparison to nonlinear NOTEARS (Zheng et al., 2020). Comparison against other methods is implicit in previous work (Zheng et al., 2020). We observe that Notears_Topo outperforms all other methods and is close to the solution of problem (8) using the true topological sort. 5.1. Additional experiments In Appendix E, we provide further experiments. We consider linear models with different noise distributions (e.g., Gaussian, Gumbel and exponential) for {ER1, ER2, ER4, SF1, SF2, SF4} graphs. See Appendix E.4. There, we observe that our methods even with random initialization still outperforms existing methods in terms of score and SHD, also the solutions are guaranteed to be local minimal. Additionally, our results are not specific to certain non-linearities. To illustrate this, we run experiments on a logistic model (binary Xj), and neural networks. See details in the Ap- Optimizing NOTEARS Objectives via Topological Swaps pendix E.5. Finally, we also report the runtime and scores of each method for linear and nonlinear models in Appendices E.4 and E.5. We analyze the sensitivity of the hyperparameters ssmall, and slarge on Algorithm 1 (see Appendix E.3). That is, we thoroughly study the effect of hyperparameters on the cardinality of the search spaces, see eq.(9), and how many times our algorithm searches in a space of large cardinality. Moreover, we test our method when using randomly chosen swapping set to demonstrate the effectiveness of (9) (see Appendix E.6). Finally, we also include an analysis on structural accuracy vs iterations to track the performance of Algorithm 1 (see Appendix E.7). 6. Conclusion Inspired by the KKT conditions, we developed new insights into the optimization-theoretic properties of NOTEARS objectives, and proposed a new bi-level algorithm with attractive local optimality guarantees. As a by-product, it can also improve the solutions of state-of-the-art solvers for (1) (e.g., NOTEARS, KKTS, GOLEM). Although proving convergence to a global minimizer is expected to be challenging, we have shown that our method has desirable properties for an optimization scheme: (a) It decreases the score in each iteration and (b) It is guaranteed to return a local minimizer (and hence also a KKT point). The key driver behind our approach is the interpretation of the KKT conditions as a proxy for choosing promising node swaps in a topological sort. An important open question for future work is the convergence of Algorithm 1: What is its iteration and computational complexity? It is also interesting to note that unlike previous methods that rely on explicitly enforcing acyclicity via h(B), our approach only uses h(B) indirectly in order to check the KKT conditions. This idea was already implicit in the KKTS method due to Wei et al. (2020), and could lead to new insights into how to optimize NOTEARS objectives and other acyclicity-constrained problems. Acknowledgments and Disclosure of Funding K. B. was supported by NSF under Grant # 2127309 to the Computing Research Association for the CIFellows 2021 Project. B.A. was supported by NSF IIS-1956330, NIH R01GM140467, and the Robert H. Topel Faculty Research Fund at the University of Chicago Booth School of Business. This work was done in part while B.A. was visiting the Simons Institute for the Theory of Computing. P.R. was supported by ONR via N000141812861, and NSF via IIS-1909816, IIS-1955532, IIS-2211907. We also thank the University of Chicago Research Computing Center for assistance with the calculations carried out in this work. Aragam, B. and Zhou, Q. Concave penalized estimation of sparse Gaussian Bayesian networks. The Journal of Machine Learning Research, 16(1):2273 2328, 2015. Aragam, B., Amini, A., and Zhou, Q. Globally optimal score-based learning of directed acyclic graphs in highdimensions. Advances in Neural Information Processing Systems, 32, 2019. Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. science, 286(5439):509 512, 1999. Bello, K., Aragam, B., and Ravikumar, P. K. Dagma: Learning dags via m-matrices and a log-determinant acyclicity characterization. In Advances in Neural Information Processing Systems, 2022. Bertsekas, D. P. Nonlinear programming. Journal of the Operational Research Society, 48(3):334 334, 1997. Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Bühlmann, P., Peters, J., and Ernest, J. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014. Chickering, D. M. Optimal structure identification with greedy search. JMLR, 3:507 554, 2003. Cussens, J. Bayesian network learning with cutting planes. ar Xiv preprint ar Xiv:1202.3713, 2012. Drton, M., Chen, W., and Wang, Y. S. On causal discovery with equal variance assumption. ar Xiv preprint ar Xiv:1807.03419, 2018. Geiger, D. and Heckerman, D. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. Annals of Statistics, 30: 1412 1440, 2002. Ghoshal, A. and Honorio, J. Learning identifiable gaussian bayesian networks in polynomial time and sample complexity. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6460 6469, 2017. Ghoshal, A. and Honorio, J. Learning linear structural equation models in polynomial time and sample complexity. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1466 1475. PMLR, 2018. Optimizing NOTEARS Objectives via Topological Swaps Jaakkola, T., Sontag, D., Globerson, A., and Meila, M. Learning bayesian network structure using lp relaxations. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 358 365, 2010. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. Gradient-based neural dag learning. In International Conference on Learning Representations, 2020. Loh, P.-L. and Bühlmann, P. High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065 3105, 2014. Margaritis, D. Learning bayesian network model structure from data. Technical report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, 2003. Margaritis, D. and Thrun, S. Bayesian network induction via local neighborhoods. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 505 511, 1999. Nandy, P., Hauser, A., and Maathuis, M. H. Highdimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151 3183, 2018. Nesterov, Y. et al. Lectures on convex optimization, volume 137. Springer, 2018. Ng, I., Ghassami, A., and Zhang, K. On the role of sparsity and dag constraints for learning linear DAGs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17943 17954. Curran Associates, Inc., 2020. Ng, I., Lachapelle, S., Ke, N. R., Lacoste-Julien, S., and Zhang, K. On the convergence of continuous constrained optimization for structure learning. In International Conference on Artificial Intelligence and Statistics, pp. 8176 8198. PMLR, 2022. Park, Y. W. and Klabjan, D. Bayesian network learning via topological order. The Journal of Machine Learning Research, 18(1):3451 3482, 2017. Peters, J. and Bühlmann, P. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. Peters, J., Mooij, J. M., Janzing, D., and Schölkopf, B. Causal discovery with continuous additive noise models. JMLR, 2014. Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. MIT press, 2017. Raskutti, G. and Uhler, C. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018. Reisach, A., Seiler, C., and Weichwald, S. Beware of the simulated dag! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772 27784, 2021. Romain, M. and d Aspremont, A. A bregman method for structure learning on sparse directed acyclic graphs. ar Xiv preprint ar Xiv:2011.02764, 2020. Scanagatta, M., de Campos, C. P., Corani, G., and Zaffalon, M. Learning bayesian networks with thousands of variables. In NIPS, pp. 1864 1872, 2015. Silander, T. and Myllymaki, P. A simple approach for finding the globally optimal bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, 2006. Solus, L., Wang, Y., and Uhler, C. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795 814, 2021. Spirtes, P. and Glymour, C. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62 72, 1991. Squires, C., Wang, Y., and Uhler, C. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pp. 1039 1048. PMLR, 2020. Teyssier, M. and Koller, D. Ordering-based search: A simple and effective algorithm for learning bayesian networks. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI 05, pp. 584 590, Arlington, Virginia, USA, 2005. AUAI Press. ISBN 0974903914. Tsamardinos, I., Aliferis, C. F., Statnikov, A. R., and Statnikov, E. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pp. 376 380, 2003. Tsamardinos, I., Brown, L. E., and Aliferis, C. F. The maxmin hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006. Optimizing NOTEARS Objectives via Topological Swaps Van de Geer, S. and Bühlmann, P. ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013. Wang, X., Du, Y., Zhu, S., Ke, L., Chen, Z., Hao, J., and Wang, J. Ordering-based causal discovery with reinforcement learning. ar Xiv preprint ar Xiv:2105.06631, 2021. Wei, D., Gao, T., and Yu, Y. DAGs with no fears: A closer look at continuous optimization for learning bayesian networks. In Advances in Neural Information Processing Systems, 2020. Xiang, J. and Kim, S. A* lasso for learning a sparse bayesian network structure for continuous variables. Advances in neural information processing systems, 26, 2013. Ye, Q., Amini, A. A., and Zhou, Q. Optimizing regularized cholesky score for order-based learning of bayesian networks. IEEE transactions on pattern analysis and machine intelligence, 43(10):3555 3572, 2020. Yu, Y., Chen, J., Gao, T., and Yu, M. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, pp. 7154 7163. PMLR, 2019. Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, 2018. Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. Learning sparse nonparametric DAGs. In International Conference on Artificial Intelligence and Statistics, pp. 3414 3425. PMLR, 2020. Zhu, S., Ng, I., and Chen, Z. Causal discovery with reinforcement learning. In International Conference on Learning Representations, 2020. SUPPLEMENTARY MATERIAL Optimizing NOTEARS Objectives via Topological Swaps A. Additional Discussion on Family of Approximators Let F = {f : fj Fj, j [d]} be a family of functions used to approximate the SCM in problem (2). In this section, we focus on the general case and discuss under what conditions that family F can be used to approximate fj and how our results apply in this general setting. We consider approximations f = (f1, . . . , fd) F that are parameterized by θ, i.e. f(x) := f(x; θ). This defines W(θ) := W(f( ; θ)) as the adjacency matrix defined by (3), which is characterized by θ. Although the following definition is standard, we pause to make this precise since it is crucial in the development that follows: Definition 3 (sub-vector). Given a vector β = (β1, . . . , βn) Rn, and we say that α is a sub-vector of β if and only if there is a subset J = {j1, . . . , jk} {1, 2, . . . , n} such that α = βJ := (βj1, . . . , βjk). Under the following general assumptions, our results and proof in Section D still apply without any modification: (i) The parametrization is separable in the following sense: θ = (θ1, . . . , θd) and each fj in (2) is only parameterized by the sub-vector θj, i.e., fj(x; θ) = fj(x; θj). (ii) There are sub-vectors θij of θj that can reveal if there is no edge from node i to node j (i.e., [W(θ)]ij = 0 if and only if θij = 0.) In this case, the general definition in (3) can be replaced with [W(θ)]ij = θij 1 without loss of generality. Write θij = (θijr)r for each sub-vector θij. Since [W(θ)]ij = θij 1, we have [W(θ+ + θ )]ij = θ+ ij + θ ij 1 = 1 (θ+ ij + θ ij) = X r (θ+ ijr + θ ijr) Therefore, [W(θ+ + θ )]ij [W(θ+ + θ )]ij r = (1)r = 1. In Section 3.4, the KKT conditions for (5) involve the term h(W (θ++θ )) θ ij . By the assumptions above, we see that h(W(θ+ + θ )) is a function of θij through [W(θ+ + θ )]ij, so that by the chain rule we have h(W(θ+ + θ )) θ ij = h(W(θ+ + θ )) [W(θ+ + θ )]ij [W(θ+ + θ )]ij θ ij = [ h(W(θ+ + θ ))]ij 1 = [ h(W(|θ|))]ij1, this equality is crucial to Lemma 1. We conclude by discussing three important special cases that satisfy the assumptions above: (1) Linear SEMs, (2) Multilayer perceptrons (MLPs), and (3) Basis expansions. Linear SEMs. A linear SEM follows the following set of equations: Xj = fj(X, zj) = w j X + zj, wj Rd, j [d], where zj R represents the noise following any distribution. Let W = [w1 | w2 | | wd] Rd d. In this case, all the model parameters are θ = W. The parameters related to node j are θj = wj, thus, each function fj is only characterized by θj. Thus, condition (i) above is clearly satisfied. Furthermore, we have θij = Wij (the (i, j)-th entry of W), where clearly there is no edge from node i to node j if and only if Wij = 0. Therefore, condition (ii) above is also satisfied. Optimizing NOTEARS Objectives via Topological Swaps Multilayer perceptrons (MLPs). Let a multilayer perceptron (MLP) with h hidden layers and a single activation σ : R R be given by: MLP(X; A(1), . . . , A(h)) = σ(A(h)σ(. . . A(2)σ(A(1)x))), A(ℓ) Rmℓ mℓ 1, m0 = d, mh = 1. Then the nonlinear SCM with additive noise can be written as: Xj = fj(X, zj) = MLP(X; A(1) j , . . . , A(h) j ) + zj, where zj N(0, 1). Let θj = (A(1) j , . . . , A(h) j ) denote the parameters for the j-th MLP, and let θ = (θ1, . . . , θd) denote all model parameters. Define θij to be the i-th column of A(1) j . Since MLP(X; A(1) j , . . . , A(h) j ) is independent of Xi if and only if θij = 0 (e.g., Zheng et al., 2020, Proposition 1), we can define [W(θ)]ij = θij 1. Then, in this case it is easy to check that conditions (i) and (ii) above are satisfied. Basis expansion. As an alternative to neural networks, we also consider the use of orthogonal basis expansions, as in (Zheng et al., 2020). Let {φr} r be an orthonormal basis of functions such that E[φr(X)] = 0 for each r and r=1 αrφr(x), αr = Z Rd φr(x)f(x)dx. Consider additive models and one-dimensional expansions as follows: Xj = fj(X, zj) = X i =j fij(Xi) + zj = X r=1 αijrφr(xi) + zj. In this case, we let θ = (αijr)i,j,r denote all model parameters, θj = (αijr)ir denote all parameters related to node j, and θij = (αijr)r denote the parameters that model the absence of an edge from node i to node j. Additionally, set [W(θ)]ij = θij 1 = P r|αijr|. Similarly, it is easy to check that conditions (i) and (ii) above are both satisfied. B. Algorithm Details B.1. Full Algorithm Description A full and reproducible outline of Algorithm 1 can be found in Algorithm 2. Note that Algorithms 4 (UPDATESORT) and 3 (FINDPARAMS) are subroutines used by Algorithm 2. B.2. Additional Details on Hyperparameters In this section, we describe more details of the proposed order-based search method in Algorithm 2. This involves initializing the number of swapping pairs ssmall to define a small search space, the number of swapping pairs slarge to define a large search space, and the maximum number of searches s0 to perform in the large swapping-pairs space. From line 31 to 32, for each iteration, Algorithm 2 invokes Algorithm 3 to find out the best values for τ , ξ , τ , ξ to control the number of swapping pairs in the search space. For reference, the predefined T and Ξ that we used in Algorithm 3 are given in Table 5. By tuning ssmall and slarge, we can control Algorithm 2 to quickly converge to a local minimum, or have the chance to escape to a better local minimum in case it finds a suboptimal one. Specifically, in Line 4, if the running solution θ π does not satisfy the KKT conditions, then, as prescribed by Lemma 3, a better topological sort can be found. In Line 10, although the running solution θ π satisfies the KKT conditions, we use strict positive values for τ , and ξ to expand the search space Y and consider potential swapping pairs that can lead us to a better local minimum. Here, the parameter s0 specifies how many times the algorithm can search in a large swapping-pairs space, with the goal to escape from a region of bad local minima. These hyperparameters are determinant to Algorithm 2 performance and are tuned to balance the trade-off between accuracy and efficiency. Remark 7. One merit of Algorithm 2 over prior work is that it does not only search for a local minimum but also tries to escape from bad local minima. Thus, our algorithm usually attains the best scores among all DAG learning Optimizing NOTEARS Objectives via Topological Swaps Algorithm 2 TOPO Require: Given a topological sort π, two predefined numbers of swapping pairs ssmall, slarge, number of search in large space s0 and initialize corresponding Zπ. Solve (7) to get Θ π, set k 0 and count 0. 1: (τ , ξ ) FINDPARAMS(Θ π, ssmall) 2: (τ , ξ ) FINDPARAMS(Θ π, slarge). 3: while Y(Θ π, τ , ξ ) = do 4: if Y(Θ π, 0, 0) = then 5: Y Y(Θ π, 0, 0) 6: for (i, j) Y do 7: πij UPDATESORT(θ π, (i, j), opt = 2). 8: Solve (7) to obtain Θ πij. 9: end for 10: else 11: Y Y(Θ π, τ , ξ ) 12: for (i, j) Y do 13: πij UPDATESORT(θ π, (i, j), opt = 1). 14: Solve (7) to obtain Θ πij. 15: end for 16: end if 17: if min(i,j) Y Q(Θ πij) Q(Θ π) then 18: update π := arg minπij Q(Θ πij) 19: else 20: if k < s0 then 21: Y Y(Θ π, τ , ξ ) = 22: for (i, j) Y do 23: πij UPDATESORT(θ π, (i, j), opt = 1). 24: Solve (7) to obtain Θ πij. 25: end for 26: if min(i,j) Y Q(Θ πij) Q(Θ π) then 27: update π := arg minπij Q(Θ πij) , k k + 1 28: else 29: Return Θ π then break 30: end if 31: else 32: Return Θ π then break 33: end if 34: end if 35: Solve (7) to obtain Θ π 36: Update Y(Θ π, τ , ξ ) 37: count count + 1 38: (τ , ξ ) FINDPARAMS(Θ π, ssmall) 39: (τ , ξ ) FINDPARAMS(Θ π, slarge). 40: end while Algorithm 3 FINDPARAMS Require: Parameter Θ, integer q that controls the size of the search space. 1: Create a predefined T = {τ1, . . . , τm} and Ξ = {ξ1, . . . , ξl}. Ensure: (τ, ξ) = arg minτ T,ξ Ξ |q |Y(Θ, τ, ξ)|| algorithms when comparable. A drawback of Algorithm 2 is the dependence on how large the search space is, which could be computationally intensive. Finally, Algorithm 2 solves (13) repeatedly, whose runtime also heavily determines the efficiency of Algorithm 2. See Section E for runtime comparisons against existing methods. Optimizing NOTEARS Objectives via Topological Swaps Algorithm 4 UPDATESORT Require: Parameter θ or topological sort π, (i, j), opt. Initialize predefined ϵ 10 8 (small). 1: if opt = 1 then 2: Swap nodes i and j, and denote the new topological sort by πij 3: else 4: θ θ 5: θ ij θij ϵ Q(Θ) θij 6: Find the topological sort of W(|θ |), denoted as πij. 7: end if Ensure: πij Parameters Values T 0 1 10 8 1 10 7 1 10 6 1 10 5 1 10 4 1 10 3 Ξ 0 1 10 7 1 10 6 5 10 6 1 10 5 5 10 5 1 10 4 5 10 4 1 10 3 5 10 3 1 10 2 5 10 2 1 10 1 5 10 1 1 2 5 10 15 20 40 Table 5. Suggested values for parameters T and Ξ in Algorithm 3. C. Irreducibility and Comparison with Prior Work C.1. Three-Node Example where KKTS and NOTEARS Fail In this section, we expand on Example 1. In particular, we show that our example was not handpicked but instead there exists several values a and b where the solutions from KKTS and NOTEARS either are DAGs with incorrect structure, or are non-optimal solutions, or both. Recall that our example follows the following SEM: X2 = a X1 + z2, (11) X3 = b X2 + z3, where zi N(0, 1) for i [3]. For the purposes of this analysis we consider the class of SEMs such that a2 > b2. Then, the true adjacent matrix and topological sort are: 0 a 0 0 0 b 0 0 0 , πtrue = [1, 2, 3]. Letting X = . The SEM (11) in vector form can be written as: X = W T true X + Z. We will use the population least square (LS) as the score function, which is defined as follows: Q(W) = E X XW 2 2 = (I Wtrue) 1(I W) 2 2 The motivation to choose such score is that it was shown by Loh & Bühlmann (2014) that Wtrue is the unique global minimizer of the population LS for linear SEMs with equal noise variances. Next, we provide a closer look as to why our algorithm is capable of learning the correct structure, while KKTS and NOTEARS fail. Optimizing NOTEARS Objectives via Topological Swaps C.1.1. THE OUTPUT OF KKTS In the KKTS algorithm of Wei et al. (2020), consider the set of edge absence constraints to be initialized at: Z0 = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)}. That is, the algorithm is initialized at the empty graph. Then, we have W (Z0) = 0, Q(W (Z0)) = 2 1 a ab a a2 + 1 b + a2b ab b + a2b (ab)2 + b2 + 1 and {(i, j) | [ h(|W (Z0)|)]ij = 0} = {(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)}. Recall that the KKTS algorithm will remove the pair (i, j) from Z0 that satisfies the following property: (i, j) = arg max {(i,j)|[ h(|W (Z0)|)]ij=0} Z0 |[ Q(W (Z0))]ij|. Now, consider the case that max{|a|, |ab|} < |(a2 + 1)b|, then the pair (3, 2) is removed from Z0 and the resulting set of edge absence constraints is Z1 = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 3)}. Then, at the next step we have: 0 0 0 0 0 0 0 a2b+b (ab)2+b2+1 0 , Q(W (Z1)) = 2 1 a (ab)2+b2+1 ab a a2+1 (ab)2+b2+1 a2b + b ab 0 (ab)2 + b2 + 1 and {(i, j) | [ h(|W (Z1)|)]ij = 0} = {(1, 2), (1, 3), (2, 1), (3, 1), (3, 2)}. Consider now the case that max{ |a| (ab)2+b2+1, |ab|} < |a|, then the pair (2, 1) is removed from Z1 and the resulting set of edge absence constraints is Z2 = {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 1), (3, 3)}. Then, at the next step we have: 0 0 0 a a2+1 0 0 0 a2b+b (ab)2+b2+1 0 , Q(W (Z2)) = 2 1 a2+1 a (ab)2+b2+1 ab 0 a2+1 (ab)2+b2+1 a2b + b 0 0 (ab)2 + b2 + 1 and {(i, j) | [ h(|W (Z2)|)]ij = 0} = {(2, 1), (3, 1), (3, 2)}. Now we note that {(i, j) | [ h(|W (Z2)|)]ij = 0} Z2 = {(3, 1)}. However, we have [ Q(W (Z2))]31 = 0, that is, even if we remove the pair (3, 1) from Z2, the corresponding Z3 = {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)} leads to W (Z3) = W (Z2), Q(W (Z3)) = Q(W (Z2)). Combining all the considerations on a and b, we can conclude that as long as the following is satisfied: |ab| < |a| < |(a2 + 1)b|, KKTS will find a DAG with incorrect structure, in fact, a DAG where all edges are reversed. Finally, one can easily see that there are infinitely many a and b satisfying the above condition. For example, let a = 1 and b = 0.55. We want to emphasize here that our result is also consistent with the result returned by the Python program provided by Wei et al. (2020). C.1.2. THE OUTPUT OF NOTEARS Since the NOTEARS implementation by Zheng et al. (2018) uses the augmented Lagrangian method and solve each inner unconstrained subproblem using the Quasi-Newton L-BFGS method, it is impossible to derive an analytical solution. Instead, we directly verify the DAG solution returned by NOTEARS by setting the ground-truth SEM a = 1, b = 0.55. 0 1.49 10 4 7.00 10 7 0.16 0 -1.55 -0.22 1.59 10 5 0 Optimizing NOTEARS Objectives via Topological Swaps We can note that Wnotears is not exactly a DAG. Thus, we use a threshold to remove small entries in Wnotears. The resulting adjacency matrix is now a DAG and is denoted by Wnotears_thres. One can clearly see that the NOTEARS solution does not recover the true structure. Wnotears_thres = 0 0 0 0.16 0 -1.55 -0.22 0 0 Let us now check the KKT conditions given in Lemma 1. We have {(i, j) | [ h(|Wnotears_thres|)]ij = 0} = {(3, 1), (2, 3), (2, 1)}, and [ Q(Wnotears_thres)]21 = 0, [ Q(Wnotears_thres)]23 = 0 and [ Q(Wnotears_thres)]31 = 0. We can then observe that NOTEARS fails at both outputting the true DAG structure, and a DAG that is a local minimum. C.1.3. THE OUTPUT OF TOPO (ALGORITHM 2) To study how TOPO works for this 3-node example, we first calculate the loss of all possible topological sorts in the following table. Topological sort π Score (1, 2, 3) 3 (1, 3, 2) 2 + b2 + 1 1+b2 (2, 1, 3) 2 + a2 + 1 a2+1 (3, 1, 2) 1 + b2 + (ab)2 + 1 1+b2 + 1+b2 1+b2+(ab)2 (2, 3, 1) 2 + a2 + 1 a2+1 (3, 2, 1) 1 1+a2 + 1+a2 1+(ab)2+b2 + 1 + b2 + (ab)2 Now, rewriting eq.(8) for the case of linear models, we have: W π arg min W π Q(W). To understand why TOPO is capable of returning the correct true structure, we next define adjacent topological sorts. Definition 4. For two topological sorts π1 and π2, we say that π1 and π2 are adjacent if there exists a pair of nodes in π1 such that when swapped the resulting topological ordering is π2. Recall that πtrue = (1, 2, 3). The following statement explains precisely the success of Algorithm 2. Corollary 3. Assume that a = 0, b = 0, and a2 > b2, then for any topological sort π = πtrue, we have that Q(W π) > Q(W πtrue) = Q(Wtrue) = 3. Moreover, there always exists an adjacent topological sort πadj such that Q(W π) > Q(W πadj). In other words, for any initial topological sort π = (1, 2, 3), TOPO (Algorithm 2) can always return πtrue and Wtrue at last. All the situations are summarized in the Table 6. Current order π Adjacent order πadj Current loss Better loss (1, 2, 3) (1, 2, 3) 3 3 (1, 3, 2) (1, 2, 3) 2 + b2 + 1 1+b2 3 (2, 1, 3) (1, 2, 3) 2 + a2 + 1 a2+1 3 (3, 1, 2) (1, 3, 2) 1 + b2 + (ab)2 + 1 1+b2 + 1+b2 1+b2+(ab)2 2 + b2 + 1 1+b2 (2, 3, 1) (1, 3, 2) 2 + a2 + 1 a2+1 2 + b2 + 1 1+b2 (3, 2, 1) (1, 2, 3) 1 1+a2 + 1+a2 1+(ab)2+b2 + 1 + b2 + (ab)2 3 Table 6. There always exists an adjacent topological sort whose score is strictly less than the current topological sort. C.1.4. ANALYSIS In this section we aim to provide more intuition as to why KKTS and NOTEARS fail in the above example. Regarding the KKTS algorithm, it removes a pair (i, j) from the set of edge absence constraints at each iteration. Loosely speaking, Optimizing NOTEARS Objectives via Topological Swaps this is equivalent to adding an edge Xi Xj at each iteration, which implies that node i must appear before node j in the topological sort, and such relative ordering in the topological sort will never be reversed in later iterations of the algorithm. Therefore, once a wrong pair (i, j) is removed from the set of edge absence constraints, the KKTS algorithm has no ability to correct such erroneous step. Although KKTS ensures that a local minimum is returned, it can learn a completely erroneous DAG structure, as shown in our example above. Regarding NOTEARS, as indicated by Wei et al. (2020), the algorithm does not guarantee to return a local minimum even under the right formulation, and in most cases the NOTEARS solution is neither the correct structure nor a local minimum. In contrast to KKTS and NOTEARS, TOPO can return correct structure in the example above regardless of the initial topological sort. Our swapping strategy allows TOPO to change the topological sort in each iteration; importantly, while TOPO checks the optimality conditions, it uses the score function as the only criterion to find another topological ordering with better score, thus, jumping from one local optimum to a better local optimum. C.2. Detailed Discussion About Irreducibility In this section, we discuss the differences between our method and the KKTS algorithm (Wei et al., 2020). One obvious difference is the type of constraint used. Another important difference is that the KKTS algorithm proposed by Wei et al. (2020) relies on an assumption they call irreducibility, which we will show by example is not needed in general. In this section, we consider one of special case of (2): Linear SEMs, which is studied by previous work (Zheng et al., 2018; Wei et al., 2020). Xj = w j X + zj, wj Rd. (12) Let W = [w1, . . . , wd]. In this case, θ is equivalent to W and θij is equivalent to Wij. Therefore, to be consistent with previous works (Zheng et al., 2018; Wei et al., 2020), we use W to replace θ. To connect the KKT conditions and local minimiality, Wei et al. (2020) used a related problem with explicit edge absence constraints, which correspond to zero-value constraints on the matrix W. Specifically, given a set Z V V , their explicit edge absence constrained problem is given by: min W Q(W; X) subject to Wij = 0, (i, j) Z. (13) Following the notation in Wei et al. (2020), we denote its optimal solution by W (Z). As with (7), this problem can be solved efficiently in fact, (7) is just a special case of (13) with Z = Zπ, where Zπ := {(π(j), π(i)) | i < j} . Driven by Theorem 1, the KKT-informed local search (KKTS) algorithm in Wei et al. (2020) repeatedly solves the edge absence problem (13) for different Z. The KKTS algorithm stops once an irreducible set Z is found, and the output W (Z) is guaranteed to be a local minimum for problem (4). Wei et al. (2020) define irreducibility of a set Z as follows: Definition 5 (Irreducibility, Wei et al., 2020). A set Z is called irreducible if (i, j) Z ( h(|W (Z)|))ij > 0. Although irreduciblity of Z is a sufficient condition for a feasible solution to be a KKT point (Theorem 8, Wei et al., 2020), it is not necessary, as the following example shows. Fix indices i0 < j0 and define a ground truth DAG W by ( 1 if i = i0, j = j0, 0 otherwise. Let Ω= [d] [d] denote all pairs of indices, we initialize Z0 = Ω\ {(i0, j0)}. Recall that the KKTS algorithm repeatedly removes elements from Z until it is irreducible. Let us assume KKTS takes m steps, the elements removed in order are (i1, j1), (i2, j2), . . . (im, jm) and define Zk = Ω\ {(i0, j0), (i1, j1) . . . , (ik, jk)}. Lemma 4. Assume Q is separable and that W is the unique global minimum of (4). Then initializing KKTS with Z0 = Ω\{(i0, j0)}, we have the following: 1. W (Zk) = W for each k = 0, 1, . . . m, Optimizing NOTEARS Objectives via Topological Swaps 2. Z0, . . . , Zm 1 are not irreducible, 3. Zm = {(j0, i0)} {(i, i)|i = 1, . . . , d} is irreducible. In other words, the global minimum W is a KKT point that is not always irreducible, although it can be written in terms of an irreducible Z. It is easy to construct models (12) and score functions Q such that W is a global minimizer: Simply choose the population least squares score with zj N(0, 1) for each j; see Loh & Bühlmann (2014) for details. Corollary 4. Irreduciblity of Z is sufficient but not necessary for KKTS to find a KKT point of problem 5. Lemma 4 has direct implications when the underlying DAG is sparse. If the initial Z0 = Ω, KKTS needs to remove most of the elements in Z0 to reach an irreducible Z, thus, it can be computationally intense and inefficient. Moreover, the score function in KKTS has a sparse regularization, and it can return the wrong W (Z) even if the current Z characterizes the edge absences of the ground truth exactly. D. Proofs of Technical Results In this section, we present the proofs of lemmas and theorems in detail. First, let us discuss more on how to solve problem (8) in the algorithm 2 which is helpful for our proof. In problem (8), we can eliminate the constraint θπ(i),π(j) = 0, i > j by plugging them back to the objective function, then it is equivalent to the following unconstrained optimization problem, Note that we can write Θ = (θπ(i),π(j), θπ(m),π(n), θ), where i > j and n > m. In this case, Θ = (0, θπ(m),π(n), θ) ({θ π(m),π(n)}n>m, θ ) = arg min Q(Θ) = arg min Q((0, θπ(m),π(n), θ)) Therefore, we can use any off-shelf optimizer that can solve such unconstrained optimization to a stationary point that will be suitable for our purpose, i.e., gradient descent or Adam (Kingma & Ba, 2014). Throughout the proof, we repeatedly use the fact that Q(Θ) θ = 0 and Q(Θ) θ π(m),π(n) = 0, n > m, π. At last, we can construct θ+, θ from θ by θ+ = max{θ, 0} and θ = max{ θ, 0}. D.1. The Extension of Theorem 1 The proof of Theorem 1 in (Wei et al., 2020) is for the case where the adjacency matrix W does not have any parametrization. For completeness and ease of reference, we state the generalization to general parametrizations here: Theorem 4 (Theorem (1) in our content). Assume that Q(θ) is convex. Then if (θ+, θ , θ) satisfies the KKT condition in (6), (θ , θ) is a local minimum for problem (4), where θ = θ+ θ . Although the proof is similar, we include a proof for completeness in Appendix D.10. D.2. Proof of Lemma 1 Proof. Let us denote θij as (θijr)r, here note θij is a vector and its each component is denoted as θijr, where r = 1, . . .. Therefore, Q(Θ) Let us simplify term h(W (θ++θ )) θ ij . First, note that [W(θ+ + θ )]ij = θ+ ij + θ ij 1 = 1 (θij + θ ij) = X r (θijr + θ ijr) (here we use the fact θ+ ij 0, θ ij 0). Remember h(W(θ+ + θ )) is a function of θij through [W(θ+ + θ )]ij, we can use chain rule h(W(θ+ + θ )) θ ij = h(W(θ+ + θ )) [W(θ+ + θ )]ij [W(θ+ + θ )]ij Optimizing NOTEARS Objectives via Topological Swaps =[ h(W(θ+ + θ ))]ij 1 =[ h(W(|θ|))]ij 1 First, for any (i, j) such that [ h(W(|θ|))]ij = h(W(θ+ + θ )) λ > max (i,j):[ h(W (|θ|))]ij>0 Q(Θ)/ θ ij 1 [ h(W(|θ|))]ij . Therefore, (6a) and (6b) are satisfied with M + ij > 0 and M ij > 0. From condition (i), we have θij = 0, that is, θ ij = 0, thus, (6c) is satisfied since θ+ ij M + ij = θ ij M ij = 0. Second, for any (i, j) such that [ h(W(|θ|))]ij = h(W(θ+ + θ )) we have from (6a) and (6b) Q(Θ) θ+ ij = M + ij 0. Q(Θ) θ ij = M ij 0. It is also known that θ+ ij = Q(Θ) From condition (ii), we set corresponding M ij = 0, then (6a) is satisfied. We also have θ ij M ij = 0, hence (6c) is satisfied. From (iii), (6d) is satisfied. From (iv), we know θ+ , θ 0. Also, it is obvious that (i, j), we have θ h(θ++θ ) = 0, it is equivalent to h(θ+ + θ ) = 0 (Wei et al., 2020, Lemma 4). The feasibility conditions in (5) are also satisfied. Thus, (θ+, θ ) satisfies the KKT conditions in (6). Finally, from Theorem 4, (θ+ θ , θ) is a local minimum for problem (4) if Q(Θ) is convex. D.3. Proof of Lemma 2 Proof. Assume p < q, node π(p) comes before π(q) in π by the definition of topological sort, so there is no directed walk from π(q) to π(p), which implies ( h(W(|θ π|)))π(p),π(q) = 0 (Wei et al., 2020, Lemma 7) and (W(|θ π|))π(q),π(p) = 0. By the optimality conditions of (8), Q(Θ π) θπ(p),π(q) = 0. In other word, possible elements in Y(Θ π, 0, 0) must has formula (π(q), π(p)) where p < q. Therefore, (θ π)π(q),π(p) = 0. By the definition of [W(|θ π|)]π(q),π(p) = (θ π)π(q),π(p) 1 = 0 D.4. Proof of Lemma 3 Proof. Let any (i, j) Y(Θ π, 0, 0), then ( h(W(|θ π|)))ij = 0 and Q(Θ π) θij = 0, it indicates there is no directed walk from j to such i. From Lemma 2, (θ π)ij = 0. Changing the value of (θ π)ij introduces new edge which can create a cycle, however, from Lemma 6 in Wei et al. (2020), changing the value of (θ π)ij cannot create directed walks from j to i, by the assumption of separability of Q(Θ) and following the same argument of proof of Lemma 8 in Wei et al. (2020), changing the value of (θ π)ij will not create cycle. Therefore, (θ π)ijr can be increased or decreased ( Q(θ π) θijr < 0 or Q(θ π) θijr > 0 ) to reduce the loss function while maintaining feasible, which implies W(|eθ|) in Algorithm 4 is still a DAG and Q(eθ) < Q(θ π). W(|eθ|) follows the topological sort πij, so Q(Θ πij) Q(eθ) < Q(θ π). D.5. Proof of Lemma 4 Proof. For Z0 = Ω\{(i0, j0)}, W (Z0) is obviously a DAG, W is global minimum of problem (4), then Q(W ) Q(W (Z0)). W is also a feasible solution for problem (13) with Z0, then Q(W (Z0)) Q(W ). W is unique by Optimizing NOTEARS Objectives via Topological Swaps assumption, thus W = W (Z0). For Z1 = Ω\{(i0, j0), (i1, j1)}, we can use the same arguments. KKTS continues until Zl = Ω\{(i0, j0), (i1, j1), . . . , (il, jl)} can not guarantee the solution W (Zl) to be a DAG. For example, if Zl 1 =Ω\{(i, j)|i < j, i = 1, . . . , d} Zl =Zl 1\{(im, jm)} The only requirement for (im, jm) is im > jm. Followed by the same argument, we know W (Zl 1) = W . Using Lemma 8 from Wei et al. (2020), W (Zl) is also a DAG, hence Q(W ) Q(W (Zl)). Besides, W is also a feasible solution for problem (13) with Zl, thus Q(W (Zl)) Q(W ). W is unique by assumption, so W = W (Zl). By the same arguments, KKTS continues until an irreducible Zm = {(j0, i0)} {(i, i)|i = 1, . . . , d} is returned. D.6. Proof of Corollary 2 Proof. Because Y((θ π), 0, 0) = , we know for any (i, j) such that [ h(W(|θ π|))]ij = 0, we have Q(Θ π) θij = 0, (ii) in Lemma 1 is satisfied. Therefore, we only need prove for any (i, j) such that [ h(W(|θ π|))]ij > 0, then (θ π)ij = 0, i.e. [W(|θ π|)]ij = 0. Because [ h(W(|θ π|))]ij > 0 implies there exist a directed walk from j to i, which means node j appear before node i in topological sort, so θij = 0. Thus, (i) in Lemma 1 is also satisfied. The explanation given at the start of the Section D fulfills condition (iii). (iv) is satisfied naturally by our construction. Therefore, Θ π is a KKT point by Lemma 1. D.7. Proof of Corollary 4 Proof. Follows from Lemma 4. D.8. Proof of Theorem 2 Proof. For any p < q, [ h(W(|θ π|))]π(q),π(p) > 0 by definition of connected estimator. Because π(p) appears before π(q) in the topological sort, [W(|θ π|)]π(q),π(p) = 0, i.e., (θ π)π(q),π(p) = 0. All pairs (π(q), π(p)) for p < q satisfies Lemma 1 condition (i). By the same argument from proof of corollary 2, all pairs (π(p), π(q)) for p < q satisfies Lemma 1 condition (ii). Condition (iii) is satisfied by the reasoning presented at the beginning of Section D. (iv) is satisfied naturally by our construction. Therefore, Θ π is KKT point, by Theorem 1, it is also a local minimum if Q is convex. Under the connected estimator assumption, the solution at each iteration is a local minimum if Q is convex. D.9. Proof of Theorem 3 Proof. If Y(θ π, 0, 0) = , we can always construct a new topological sort πij by Lemma 3 and strictly decreases score function. Otherwise, Algorithm searches in space Y(Θ π, τ , ξ ) or Y(Θ π, τ , ξ ) to find better topological sort until it cannot. Note that at last iteration, it must be that Y(Θ π, 0, 0) = , such θ π is KKT point, i.e. local minimum if Q is convex by Theorem 4. D.10. Proof of Theorem 4 Before we jump into the proof, let us first consider the problem min Θ Q(Θ) subject to θij = 0, (i, j) Z (14) Remember the definition θ = Θ \ θ .The necessary conditions of optimality for (14) are θij =0, (i, j) Z (15a) θij =0, (i, j) Z (15b) Given a KKT point (θ+, θ , θ) in (6), we can define the set P := {(i, j) : [ h(W(θ+ + θ ))]ij > 0} (16) Optimizing NOTEARS Objectives via Topological Swaps Although set P doesn t appear in Theorem 4 explicitly, but it appears in Lemma 5 which is key to prove the Theorem 4. Lemma 5. If (θ+, θ , θ) satisfies the KKT conditions in (6), then Θ = (θ , θ) satisfies the optimality conditions in (15) for Z = P which is defined in (16), where θ = θ+ θ . If in addition Q(Θ) is convex, then Θ is a minimizer of (14) for Z = P. Proof of Theorem 4. Let θ be feasible solution (i.e. W(|θ|) is a DAG) to (4) with θ θ F < ϵ (the Frobenius norm is used for concreteness). Since h(W(|θ|)) is a continuous function of θ, there exists a sufficiently small ϵ > 0 such that [ h(W(|θ|))]ij > 0 whenever [ h(W(|θ |))]ij > 0, in other words for (i, j) in the set P. Then for feasible θ within such an ϵ-ball around θ , it follows from the same argument in proof of Lemma 5, θij = 0 for (i, j) P. θ is therefore a feasible solution to (14) for Z = P. By Lemma 5 and the convexity of Q, we then have Q(Θ ) Q(Θ) for all feasible θ such that θ θ F < ϵ. D.11. Proof of Lemma 5 Proof. For (i, j) Z = P, we have [ h(W(|θ|))]ij = 0, because h(W (θ++θ )) θ ij = [ h(W(|θ|))]ij1, then h(W (θ++θ )) θ ij = 0. From (6a) and (6b), θ+ ij = M + ij 0 Q(Θ) θ ij = M ij 0 It is also know that Q(Θ) θ+ ij = Q(Θ) θ ij , so Q(Θ) θ ij = 0, it is equivalent to Q(Θ) θij = 0. It means (15a) is satisfied. For (i, j) Z = P, we have [ h(W(|θ|))]ij > 0. Since (θ+, θ ) is feasible solution, which means W(|θ|) is a DAG. Moreover, [ h(W(|θ|))]ij > 0 indicates there is a directed path from node j to i, then it implies there is no edge from node i to node j. Hence, [W(|θ|)]ij = θij 1 = θ+ ij 1 + θ ij 1 = 0, i.e. θ+ ij = θ ij = 0. we conclude θ ij = θ+ ij θ ij = 0. Now (15b) is satisfied. From (6d), it is obvious (15c) is satisfied. Therefore, (θ , θ) satisfies the optimality conditions in (15) for Z = P, where θ = θ+ θ . If Q(θ) is convex function, conditions in (15) is also sufficient for optimality in (14). E. Detailed Experiments E.1. Experimental Setting Here we describe the details about how to generate graphs and data for Linear SEMs with different noise distributions, fully connected graphs, logistic models and nonlinear models with neural networks. For each model, a random graph G was generated from one of two random graph models, Erd os-Rényi (ER) or scale-free (SF) with kd edges (k {1, 2, 4}) on average, denoted by ERk or SFk. Erd os-Rényi (ER), Random graphs whose edges are add independently with equal probability. We simulated models with d, 2d and 4d edges (in expectation) each, denoted by ER1, ER2, and ER4 respectively. Scale-free network(SF). Network simulated according to the preferential attachment process (Barabási & Albert, 1999). We simulated scale-free network with d, 2d and 4d edges and β = 1, where β is the exponent used in the preferential attachment process. Linear SEMs. Given a random DAG B {0, 1}d d from one of these two graphs, we assigned edge weights independently from Unif([ 2, 0.5] [0.5, 2]) to obtain a weight matrix W Rd d. Given W, we sampled X = W X + z Rd according to the following three noise models: Gaussian noise with equal variance(Gauss-EV). z N(0, Id d) Gaussian noise with unequal variance (Gauss-NV): zi N(0, σ2 i ), i = 1, . . . , d where σi Unif[1, 2] Exponential noise (Exp). zj Exp(1), j = 1, . . . , d Optimizing NOTEARS Objectives via Topological Swaps Gumbel noise (Gumbel). zj Gumbel(0, 1), j = 1 . . . , d Based on these models, we generated random datasets X Rn d by generating the rows i.i.d. according to one of the models above. For each simulation, we generated n = 1000 samples for graphs with d {10; 20; 50; 100} nodes. For each dataset, we run FGS, PC, NOTEARS, KKTS with NOTEARS as initialization, TOPO with random initialization, TOPO with NOTEARS as initialization, and GOLEM-EV(equal variance), GOLEM-NV(Unequal variance). Here random initialization means a topological sort π is randomly sampled, the solve (7) to obtain θ π as initialization. Finally, a post-processing threshold of ω = 0.3 is applied on W, following (Zheng et al., 2018). Since FGS outputs a CPDAG instead of a DAG or weight matrix, we orient the undirected edges favorably when making comparisons. In linear model with unequal variance Gaussian noise, the minimax concave penalty (MCP) is used to approximate ℓ0 penalty, 2β if |w| βλ 2 otherwise and set λ = 0.005 and β = 10. For TOPO, we use the least-square loss Q(W, X) = 1 2n X XW 2 F without any regularization for all noise type. We also use the polynomial acyclicity penalty h(A) = Tr((I + A/d)d) d (Yu et al., 2019) and h(A) = log det(I A) (Bello et al., 2022), because it is faster and more accurate than h(A) = Tr(e A) d (Zheng et al., 2018). For the choices of ssmall, slarge, s0, Table 7 summarizes the suggested hyerparameters. The basic idea is to increase ssmall, slarge, s0 when d grows or graph get denser. # node ssmall slarge s0 d = 10 30 45 1 d = 20 50 150 1 d = 50 100 1000 10 d = 100 150 2500 15 Table 7. Suggested hyperparamters for ssmall, slarge, s0 Logistic Models. Given G, we assigned edge weights independently from Unif([ 2, 0.5] [0.5, 2]) to obtain a weight matrix W Rd d. Given W, we sample Xj according to following Xj = Bernoulli(exp(w j X)/(1 + exp(w j X))) j = 1, . . . , d Based on these models, we generated random datasets X Rn d by generating the rows i.i.d. according to one of the models above. For each simulation, we generated n = 10000 samples for graphs with d {10; 20; 30; 40; 50} nodes. For each dataset, we run FGS, PC, NOTEARS, TOPO with random initialization, TOPO with NOTEARS as initialization. We use penalized log-likelihood as score function, i.e. Q(f, X) = 1 i=1 1 n (log(1n + exp(fi(X))) xi fi(X)) + λ W 1 where λ = 0.01. Fully Connected Graphs. We randomly generate a topological sort π, and generated a fully connected graph that is consistent with topological sort π. Other setting is the same as Linear SEM. Because this is a really hard problem, we increase ssmall, slarge, s0 compared to Table 7. Nonlinear Models with Neural Networks. We mainly follow the nonlinear setting in Zheng et al. (2020). Given G, we simulate the SEM: Xj = fj(Xpa(j)) + zj j [d] where zj N(0, 1). Here fj is a randomly initialized MLP as described in Section 3.3 Optimizing NOTEARS Objectives via Topological Swaps For TOPO, the score function is Q(f, X) = 1 i=1 xi ˆfi(X) 2 2 Here each ˆfi is chosen as MLP with one hidden layer of size 30 and sigmoid activation. Implementation The implementation details of baseline are listed below: FGS and PC are standard baseline for structure learning. The implementation is based on the py-causal package, available at https://github.com/bd2kccd/py-causal. For PC algorithm, use Fisher Z test. For GES, we use cg-bic-scores and max Degree=50. NOTEARS (NOTERAS_MLP) was implemented using Python code: https://github.com/xunzheng/notears. Its score function is least square loss with ℓ1 regularization. We use default threshold ω = 0.3. KKTS was implemented using Python code: https://github.com/skypea/DAG_No_Fear. We allow KKTS to reverse edges in each iteration to achieve best performance. GOLEM was implemented using Python and Tensorflow code: https://github.com/ignavierng/golem. We use default parameters. In the experiments, we use default hyperparameters for those baseline unless otherwise stated. E.2. Metrics We evaluate the performance of each algorithm with the following three metrics: Structure Hamming distance (SHD): A standard benchmark in the structure learning literature that counts the total number of edges additions, deletions, and reversals needed to convert the estimated graph into the true graph. For PC and GES, they all return CPDAG that may contain undirected edges, in which case we evaluate them favorably by assuming correct direction for undirected edges whenever possible. Score: the value of least square score function. KKT: Whether solution satisfies the KKT conditions, 1 stands for Yes and 0 stands for No. Define a KKT matrix for θ, denoted as K(θ). θij | if h(W(θ)) = 0 |W(θ)ij| if h(W(θ)) > 0 KKT = 1 if maxij [K(θ)]ij = 0 0 if maxij [K(θ)]ij = 0 Timing: how much time the algorithm takes to run, we use it to measure the speed of the algorithms. E.3. Sensitivity of ssmall, slarge, s0 In Tables 2, 3, 4, and 5, we investigate the effect of sizes of search space and the number of searching times in larger spaces on Algorithm 2. Here we focus on two cases: (1) Simple case: ER1 graphs with Gaussian noise and d = 100. (2) Hard case: SF4 graphs with Guassian noise and d = 100. Columns represent different ssmall = 50, 150, 200. Rows represent different slarge = 1000, 2000, 3000. Blank implies algorithm has stopped at current iteration. Here we use n0 to indicate how many large searches has been used. Generally speaking, for sparser graphs, using small search space and small s0 are enough to return a good solution. While for denser graphs, the performance of Algorithm 2 is more sensitive to the choice of ssmall, slarge, s0. Optimizing NOTEARS Objectives via Topological Swaps n0 = 0 n0 = 1 n0 = 2 n0 = 3 50 150 200 50 150 200 50 150 200 50 150 200 1000 136 136 136 20 11 0 8 6 5 0 2000 136 136 136 19 11 0 8 6 4 0 3000 136 136 136 19 11 0 8 6 2 0 Table 8. Structural Hamming Distance (SHD) for different ssmall, slarge, n0 with d = 100 and n = 1000 on an Gaussian ER1 graph n0 = 0 n0 = 1 n0 = 2 n0 = 3 50 150 200 50 150 200 50 150 200 50 150 200 1000 113.017 113.017 113.017 49.570 48.291 47.215 47.731 47.874 47.451 47.219 2000 113.017 113.017 113.017 49.281 48.291 47.215 48.141 47.874 47.438 47.219 3000 113.017 113.017 113.017 49.281 48.291 47.215 48.141 47.874 47.369 47.219 Table 9. Score for different ssmall, slarge, n0 with d = 100 and n = 1000 on an Gaussian ER1 graph n0 = 0 n0 = 1 n0 = 2 n0 = 3 50 150 200 50 150 200 50 150 200 50 150 200 1000 776 405 322 672 244 144 568 185 143 295 58 143 2000 774 405 349 693 311 40 455 112 38 56 0 38 3000 779 405 366 574 119 144 272 118 71 44 0 50 Table 10. Structural Hamming Distance (SHD) for different ssmall, slarge, n0 with d = 100 and n = 1000 on an Gaussian SF4 graph n0 = 0 n0 = 1 n0 = 2 n0 = 3 50 150 200 50 150 200 50 150 200 50 150 200 1000 194.871 67.834 63.498 162.679 57.024 50.124 82.946 55.102 49.199 58.244 48.113 49.198 2000 189.561 67.834 62.848 157.953 61.346 48.028 83.159 50.351 47.800 47.905 47.695 47.799 3000 187.662 67.843 62.79 106.329 54.097 51.45 56.241 49.924 49.70 47.925 47.694 47.71 Table 11. Loss for different ssmall, slarge, n0 with d = 100 and n = 1000 on an Gaussian SF4 graph Optimizing NOTEARS Objectives via Topological Swaps E.4. Linear Models SHD comparisons: ER and SF graphs without FGES and PC exp gauss gumbel 25 50 75 100 d (Number of nodes) Structural Hamming Distance (SHD) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) Methods GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO RANDOM_TOPO Figure 1. Stuctural Hamming distance (SHD) (lower is better). Row: noise type of SEM. Columns: random graph types, {SF, ER}-k = {Scale-Free,Erd os-Rényi } graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. SHD comparisons: ER graphs exp gauss gumbel 25 50 75 100 d (Number of nodes) Structural Hamming Distance (SHD) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) Methods FGES GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO PC RANDOM_TOPO Figure 2. Stuctural Hamming distance (SHD) (lower is better). Row: noise type of SEM. Columns: random graph types, {ER}-k = {Erd os-Rényi } graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Optimizing NOTEARS Objectives via Topological Swaps SHD comparisons: SF graphs exp gauss gumbel 25 50 75 100 d (Number of nodes) Structural Hamming Distance (SHD) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) Methods FGES GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO PC RANDOM_TOPO Figure 3. Structural Hamming distance(SHD) (lower is better). Row: noise type of SEM. Columns: random graph types, {SF}-k = {scale free} graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Running time comparisons: ER graphs exp gauss gumbel 25 50 75 100 d (Number of nodes) Time (seconds) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) Methods FGES GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO PC RANDOM_TOPO Figure 4. Runtime. Row: noise type of SEM. Columns: random graph types, {ER}-k = {Erd os-Rényi } graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Optimizing NOTEARS Objectives via Topological Swaps Running time comparisons: SF graphs exp gauss gumbel 25 50 75 100 d (Number of nodes) Time (seconds) exp gauss gumbel 25 50 75 100 d (Number of nodes) exp gauss gumbel 25 50 75 100 d (Number of nodes) Methods FGES GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO PC RANDOM_TOPO Figure 5. Structural Hamming distance(SHD) (lower is better). Row: noise type of SEM. Columns: random graph types, {SF}-k = {scale free} graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Score comparisons: ER graphs ER1 ER2 ER4 exp gauss gumbel 25 50 75 100 25 50 75 100 25 50 75 100 d (Number of nodes) Methods NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO RANDOM_TOPO Figure 6. least square score (lower is better). Row: noise type of SEM. Columns: random graph types, {ER}-k = {Erd os-Rényi } graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Optimizing NOTEARS Objectives via Topological Swaps Score comparisons: SF graphs SF1 SF2 SF4 exp gauss gumbel 25 50 75 100 25 50 75 100 25 50 75 100 d (Number of nodes) Methods GOLEM NOFEARS_NOTEARS NOTEARS NOTEARS_TOPO RANDOM_TOPO Figure 7. least square score (lower is better). Row: noise type of SEM. Columns: random graph types, {SF}-k = {scale free} graphs with kd expected edges. Here, nofears_notears (KKTS algorithm (Wei et al., 2020) uses NOTEARS solution as initial point). Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. E.5. Nonlinear Models E.5.1. LOGISTIC MODEL 10 20 30 40 50 d (Number of nodes) Structural Hamming Distance (SHD) 10 20 30 40 50 0 d (Number of nodes) 10 20 30 40 50 d (Number of nodes) 10 20 30 40 50 d (Number of nodes) Structural Hamming Distance (SHD) 10 20 30 40 50 d (Number of nodes) 10 20 30 40 50 d (Number of nodes) method fges notear pc topo_notear topo_random Figure 8. Structural Hamming distance(SHD) for Logistic Model, Row: random graph types, {SF, ER}-k = {Scale-Free,Erd os-Rényi } graphs. Columns: kd expected edges. Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) Error bars represent standard errors over 10 simulations. Optimizing NOTEARS Objectives via Topological Swaps E.5.2. NEURAL NETWORKS SHD comparison 10 20 30 40 0 d (Number of nodes) Structural Hamming Distance (SHD) 10 20 30 40 0 d (Number of nodes) 10 20 30 40 0 d (Number of nodes) 10 20 30 40 d (Number of nodes) Structural Hamming Distance (SHD) 10 20 30 40 0 d (Number of nodes) 10 20 30 40 0 d (Number of nodes) Methods FGES NOTEARS_MLP NOTEARS_TOPO PC TRUE Figure 9. Structural Hamming distance(SHD) for Nonlinear Model with Neural Network, Row: random graph types, {SF, ER} = {Scale Free,Erd os-Rényi } graphs. Columns: kd expected edges. Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) True(baseline): solution to (8) with true topological sort using Neural Network. Error bars represent standard errors over 10 simulations. Score comparison 10 20 30 40 5 d (Number of nodes) 10 20 30 40 5 d (Number of nodes) 10 20 30 40 d (Number of nodes) 10 20 30 40 5 d (Number of nodes) 10 20 30 40 5 d (Number of nodes) 10 20 30 40 5 d (Number of nodes) Methods NOTEARS_MLP NOTEARS_TOPO TRUE Figure 10. Score for Nonlinear Model with Neural Network, Row: random graph types, {SF, ER} = {Scale-Free,Erd os-Rényi } graphs. Columns: kd expected edges. Our methods are Random_Topo (random initialization), and Notears_Topo (using NOTEARS solution as initial point.) True(baseline): solution to (8) with true topological sort using Neural Network. Error bars represent standard errors over 10 simulations. Optimizing NOTEARS Objectives via Topological Swaps E.6. Comparison against randomly chosen swapping set TOPO Random n d # edge SHD loss SHD loss 1000 20 80 0.1 9.85 32.5 26.85 1000 50 200 3 24.33 126.7 57.33 1000 100 400 13.75 47.45 286.9 107.95 Table 12. TOPO: the candidate swapping set Y(θ, τ, ξ) by (9) . Random : the TOPO algorithm chooses the candidate swapping set Y(θ, τ, ξ) randomly. Model: Linear model with Gaussian noise. Graph type: ER4 graphs. It justifies choosing swapping set Y(θ, τ, ξ) by (9) can significantly improve the performance of TOPO Algorithm. Optimizing NOTEARS Objectives via Topological Swaps E.7. Accuracy vs iteration (e) d = 100 (f) d = 100 Figure 11. Iteration vs SHD (left)/Score (right). Model: linear model with Gaussian noise. Graph type: ER4 graphs. Black: search in small space. Red: search in large space. When graph is small, searching in small space is enough for finding a good local optimal, but when graph gets larger, searching in large space helps to jump out of local point and decrease the score. Optimizing NOTEARS Objectives via Topological Swaps E.8. Greedy Strategy 25 50 75 100 25 50 75 100 0 d (Number of nodes) Structural Hamming Distance (SHD) Methods RANDOM_GREEDY_TOPO RANDOM_TOPO (a) Structural Hamming Distance (SHD) 25 50 75 100 25 50 75 100 0 d (Number of nodes) Time (seconds) Methods RANDOM_GREEDY_TOPO RANDOM_TOPO (b) Run-time (seconds) Figure 12. Comparison between greedy scheme and non-greedy scheme by SHD & running time. Random_Topo (TOPO starts with random initialization and uses the swap that decreases the score the most at each iteration), and Random_Greedy_Topo (TOPO starts with random initialization and uses the swap once it is found to decrease score.) Model: linear model with Gaussian noise. Graph type: ER4 graphs. Greedy scheme significantly improves time efficiency by sacrificing just a little accuracy. F. Broader Impacts Bayesian networks are fundamental models that represent the probabilistic relationship about how data are generated by a set of random variables. Our work contributes to the most fundamental questions: What is the underlying structure that generates data? Specifically speaking, how can one recover such structure accurately and efficiently? We propose an algorithm with theoretical guarantees to address them. The significant contribution of this work is about better solving a nonconvex continuous score-based structure learning formulation. The dramatic improvements in accuracy means better structure recovery and more accurate discovery about the underlying probabilistic relationships. A potential negative impact of this work is that errors in structure learning may compound into potentially more serious downstream errors. For example, a false discovery about causality may result in a company investing tons of money and efforts to remedy an incorrectly detected cause to a problem, resulting in immeasurable losses. How to prevent incorrect causation under this continuous framework is a crucial and exciting future research direction.