# boosting_causal_discovery_via_adaptive_sample_reweighting__107f8e99.pdf Published as a conference paper at ICLR 2023 BOOSTING DIFFERENTIABLE CAUSAL DISCOVERY VIA ADAPTIVE SAMPLE REWEIGHTING An Zhang1,2, Fangfu Liu3, Wenchang Ma2, Zhibo Cai4, Xiang Wang 5, Tat-seng Chua1,2 1Sea-NEx T Joint Lab, 2National University of Singapore, 3Tsinghua University 4Renmin University of China, 5University of Science and Technology of China anzhang@u.nus.edu, liuff19@mails.tsinghua.edu.cn, e0724290@u.nus.edu caizhibo@ruc.edu.cn, xiangwang1223@gmail.com, dcscts@nus.edu.sg Under stringent model type and variable distribution assumptions, differentiable score-based causal discovery methods learn a directed acyclic graph (DAG) from observational data by evaluating candidate graphs over an average score function. Despite great success in low-dimensional linear systems, it has been observed that these approaches overly exploit easier-to-fit samples, thus inevitably learning spurious edges. Worse still, the common homogeneity assumption can be easily violated, due to the widespread existence of heterogeneous data in the real world, resulting in performance vulnerability when noise distributions vary. We propose a simple yet effective model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, Re Score for short, where the weights tailor quantitatively to the importance degree of each sample. Intuitively, we leverage the bilevel optimization scheme to alternately train a standard DAG learner and reweight samples that is, upweight the samples the learner fails to fit and downweight the samples that the learner easily extracts the spurious information from. Extensive experiments on both synthetic and real-world datasets are carried out to validate the effectiveness of Re Score. We observe consistent and significant boosts in structure learning performance. Furthermore, we visualize that Re Score concurrently mitigates the influence of spurious edges and generalizes to heterogeneous data. Finally, we perform the theoretical analysis to guarantee the structure identifiability and the weight adaptive properties of Re Score in linear systems. Our codes are available at https://github.com/anzhang314/Re Score. 1 INTRODUCTION Learning causal structure from purely observational data (i.e., causal discovery) is a fundamental but daunting task (Chickering et al., 2004; Shen et al., 2020). It strives to identify causal relationships between variables and encode the conditional independence as a directed acyclic graph (DAG). Differentiable score-based optimization is a crucial enabler of causal discovery (Vowels et al., 2021). Specifically, it is formulated as a continuous constraint optimization problem by minimizing the average score function and a smooth acyclicity constraint. To ensure the structure is fully or partially identifiable (see Section 2), researchers impose stringent restrictions on model parametric family (e.g., linear, additive) and common assumptions of variable distributions (e.g., data homogeneity) (Peters et al., 2014; Ng et al., 2019a). Following this scheme, recent follow-on studies (Kalainathan et al., 2018; Ng et al., 2019b; Zhu et al., 2020; Khemakhem et al., 2021; Yu et al., 2021) extend the formulation to general nonlinear problems by utilizing a variety of deep learning models. However, upon careful inspections, we spot and justify two unsatisfactory behaviors of the current differentiable score-based methods: Differentiable score-based causal discovery is error-prone to learning spurious edges or reverse causal directions between variables, which derails the structure learning accuracy (He et al., 2021; Xiang Wang is the corresponding author, also with the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center. Published as a conference paper at ICLR 2023 饾惔 饾渶!~ 饾憟 2,2 2 饾惔+ 饾渶"~ -饾憗0,1 , 饾憙$ 饾憗0,0.1 , 饾憙% 饾惗 2饾惖+ 饾渶#~ 饾憗0,1 NOTEARS + Re Score SHD = 0 SHD = 0 SHD = 1 SHD = 2 饾憙# = 1, 饾憙$ = 0 饾憙# = 0.2, 饾憙$ = 0.8 饾憙# = 0, 饾憙$ = 1 Figure 1: A simple example of basic chain structure that NOTEARS would learn spurious edges while Re Score can help to mitigate the bad influence. Ng et al., 2022). We substantiate our claim with an illustrative example as shown in Figure 1 (see another example in Appendix D.3.1). We find that even the fundamental chain structure in a linear system is easily misidentified by the state-of-the-art method, NOTEARS (Zheng et al., 2018). Despite being appealing in synthetic data, differentiable score-based methods suffer from severe performance degradation when encountering heterogeneous data (Huang et al., 2020; 2019). Considering Figure 1 again, NOTEARS is susceptible to learning redundant causations when the distributions of noise variables vary. Taking a closer look at this dominant scheme (i.e., optimizing the DAG learner via an average score function under strict assumptions), we ascribe these undesirable behaviors to its inherent limitations: The collected datasets naturally include an overwhelming number of easy samples and a small number of informative samples that might contain crucial causation information (Shrivastava et al., 2016). Averagely scoring the samples deprives the discovery process of differentiating sample importance, thus easy samples dominate the learning of DAG. As a result, prevailing score-based techniques fail to learn true causal relationship but instead yield the easier-to-fit spurious edges. Noise distribution shifts are inevitable and common in real-world training, as the observations are typically collected at different periods, environments, locations, and so forth (Arjovsky et al., 2019). As a result, the strong assumption of noise homogeneity for differentiable DAG learner is easily violated in real-world data (Peters et al., 2016). A line of works (Ghassami et al., 2018; Wang et al., 2022) dedicated to heterogeneous data can successfully address this issue. However, they often require explicit domain annotations (i.e., ideal partition according to heterogeneity underlying the data) for each sample, which are prohibitively expensive and hard to obtain (Creager et al., 2021), thus further limiting their applicability. To reshape the optimization scheme and resolve these limitations, we propose to adaptively reweight the samples, which de facto concurrently mitigates the influence of spurious edges and generalizes to heterogeneous data. The core idea is to discover and upweight a set of less-fitted samples that offer additional insight into depicting the causal edges, compared to the samples easily fitted via spurious edges. Focusing more on less-fitted samples enables the DAG learner to effectively generalize to heterogeneous data, especially in real-world scenarios whose samples typically come from disadvantaged domains. However, due to the difficulty of accessing domain annotations, distinguishing such disadvantaged but informative samples and adaptively assigning their weights are challenging. Towards this end, we present a simple yet effective model-agnostic optimization framework, coined Re Score, which automatically learns to reweight the samples and optimize the differentiable DAG learner, without any knowledge of domain annotations. Specifically, we frame the adaptive weights learning and the differentiable DAG learning as a bilevel optimization problem, where the outer-level problem is solved subject to the optimal value of the inner-level problem: In the inner loop, the DAG learner is first fixed and evaluated by the reweighted score function to quantify the reliance on easier-to-fit samples, and then the instance-wise weights are adaptively optimized to induce the DAG learner to the worst-case. In the outer loop, upon the reweighted observation data where the weights are determined by the inner loop, any differential score-based causal discovery method can be applied to optimize the DAG learner and refine the causal structure. Published as a conference paper at ICLR 2023 Benefiting from this optimization scheme, our Re Score has three desirable properties. First, it is a model-agnostic technique that can empower any differentiable score-based causal discovery method. Moreover, we theoretically reveal that the structure identifiability is inherited by Re Score from the original causal discovery method in linear systems (cf. Theorem 1). Second, Re Score jointly mitigates the negative effect of spurious edge learning and performance drop in heterogeneous data via auto-learnable adaptive weights. Theoretical analysis in Section 3.3 (cf. Theorem 2) validates the oracle adaptive properties of weights. Third, Re Score boosts the causal discovery performance by a large margin. Surprisingly, it performs competitively or even outperforms CD-NOD (Huang et al., 2020) and DICD (Wang et al., 2022), which require domain annotation, on heterogeneous synthetic data and real-world data (cf. Section 4.2). 2 DIFFERENTIABLE CAUSAL DISCOVERY We begin by introducing the task formulation of causal discovery and the identifiability issue. We then present the differentiable score-based scheme to optimize the DAG learner. Task Formulation. Causal discovery aims to infer the Structural Causal Model (SCM) (Pearl, 2000; Pearl et al., 2016) from the observational data, which best describes the data generating procedure. Formally, let X Rn d be a matrix of observational data, which consists of n independent and identically distributed (i.i.d.) random vectors X = (X1, . . . , Xd) Rd. Given X, we aim to learn a SCM (PX, G), which encodes a causal directed acyclic graph (DAG) with a structural equation model (SEM) to reveal the data generation from the distribution of variables X. Specifically, we denote the DAG by G = (V (G), E(G)), where V (G) is the variable set and E(G) collects the causal directed edges between variables. We present the joint distribution over X as PX, which is Markov w.r.t. G. The probability distribution function of PX is factored as p(x) = Qd i=1 P(xi|xpa(i)), where pa(i) = {j V (G) : Xj Xi E(G)} is the set of parents of variable Xi in G and P(xi|xpa(i)) is the conditional probability density function of variable Xi given Xpa(i). As a result, the SEM can be formulated as a collection of d structural equations: Xi = fi(Xpa(i), Ni), i = 1, , d (1) where fi : R|Xpa(i)| R can be any linear or nonlinear function, and N = (N1, . . . , Nd) are jointly independent noise variables. Identifiability Issue. In general, without further assumption on the SEM (cf. Equation 1), it is not possible to uniquely learn the DAG G by only using the observations of PX. This is the identifiability issue in causal discovery (Lachapelle et al., 2020). Nonetheless, with the assumption of the SEM, the DAG G is said to be identifiable over PX, if no other SEM can encode the same distribution PX with a different DAG under the same assumption. To guarantee the identifiability, most prior studies restrict the form of the structural equations to be additive w.r.t. to noises, i.e., additive noise models (ANM). Assuming ANM, as long as the structural equations are linear with non-Gaussian errors (Shimizu et al., 2006; Loh & B uhlmann, 2014), linear Gaussian model with equal noise variances (Peters & B uhlmann, 2014), or nonlinear structural equation model with mild conditions (Hoyer et al., 2008; Zhang & Hyvarinen, 2009; Peters et al., 2014), then the DAG G is identifiable. Solution to Causal Discovery. Prevailing causal discovery approaches roughly fall into two lines: constraintand score-based methods (Spirtes & Zhang, 2016; Glymour et al., 2019). Specifically, constraint-based methods (Spirtes et al., 1995; Spirtes & Glymour, 1991; Colombo et al., 2012) determine up to the Markov equivalence class of causal graphs, based on conditional independent tests under certain assumptions. Score-based methods (Vowels et al., 2021) evaluate the candidate graphs with a predefined score function and search the DAG space for the optimal graph. Here we focus on the score-based line. Score-based Causal Discovery. With a slight abuse of notation, G refers to a directed graph in the rest of the paper. Formally, the score-based scheme casts the task of DAG learning as a combinatorial optimization problem: min G S(G; X) = L(G; X) + 位Rsparse(G) s.t. G DAG, (2) Here this problem consists of two ingredients: the combinatorial acyclicity constraint G DAG and the score function S(G; X). The score function composes two terms: (1) the goodness-of-fit Published as a conference paper at ICLR 2023 measure L(G; X) = 1 n Pn i=1 l(xi, f(xi)), where l(xi, f(xi)) represents the loss of fitting observation xi; (2) the sparsity regularization Rsparse(G) stipulating that the total number of edges in G should be penalized; And 位 is a hyperparameter controlling the regularization strengths. Next, we will elaborate on the previous implementations of these two major ingredients. To implement S(G; X), various approaches have been proposed, such as penalized least-squares loss (Zheng et al., 2020; 2018; Ng et al., 2019b), Evidence Lower Bound (ELBO) (Yu et al., 2019), loglikelihood with complexity regularizers (Kalainathan et al., 2018; Van de Geer & B uhlmann, 2013; Ng et al., 2020), Maximum Mean Discrepancy (MMD) (Goudet et al., 2018), Bayesian Information Criterion (BIC) (Geiger & Heckerman, 1994; Zhu et al., 2020), Bayesian Dirichlet equivalence uniform (BDeu) score (Heckerman et al., 1995), Bayesian Gaussian equivalent (BGe) score (Kuipers et al., 2014), and others (Huang et al., 2018; Bach & Jordan, 2002; Sokolova et al., 2014). As G DAG enforces G to be acyclic, it becomes the main obstacle to the score-based scheme. Prior studies propose various approaches to search in the acyclic space, such as greedy search (Chickering, 2002; Hauser & B uhlmann, 2012), hill-climbing (G amez et al., 2011; Tsamardinos et al., 2006), dynamic programming (Silander & Myllym aki, 2006; Koivisto & Sood, 2004), A* (Yuan & Malone, 2013), integer linear programming (Jaakkola et al., 2010; Cussens, 2011). Differentiable Score-based Optimization. Different from the aforementioned search approaches, NOTEARS (Zheng et al., 2018) reframes the combinatorial optimization problem as a continuous constrained optimization problem: min G S(G; X) s.t. H(G) = 0, (3) where H(G) = 0 is a differentiable equality DAG constraint. As for the DAG constraint H(G) = 0, the prior effort (Zheng et al., 2018) turns to depict the DAGness of G s adjacency matrix A(G) {0, 1}d d. Specifically, [A(G)]ij = 1 if the causal edge Xj Xi exists in E(G), otherwise [A(G)]ij = 0. Prevailing implementations of DAGness constraints are H(G) = Tr(e A A) d (Zheng et al., 2018), H(G) = Tr[(I + 伪A A)d] d (Yu et al., 2019), and others (Wei et al., 2020; Kyono et al., 2020; Bello et al., 2022; Zhu et al., 2021). As a result, this optimization problem in Equation 3 can be further formulated via the augmented Lagrangian method as: min G S(G; X) + PDAG(G), (4) where PDAG(G) = 伪H(G) + 蟻 2|H(G)|2 is the penalty term enforcing the DAGness on G, and 蟻 > 0 is a penalty parameter and 伪 is the Lagrange multiplier. 3 METHODOLOGY OF RESCORE On the basis of differentiable score-based causal discovery methods, we first devise our Re Score and then present its desirable properties. 3.1 BILEVEL FORMULATION OF RESCORE Aiming to learn the causal structure accurately in practical scenarios, we focus on the observational data that is heterogeneous and contains a large proportion of easy samples. Standard differentiable score-based causal discovery methods apply the average score function on all samples equally, which inherently rely on easy samples to obtain high average goodness-of-fit. As a result, the DAG learner is error-prone to constructing easier-to-fit spurious edges based on the easy samples, while ignoring the causal relationship information maintained in hard samples. Assuming the oracle importance of each sample is known at hand, we can assign distinct weights to different samples and formulate the reweighted score function Sw(G; X), instead of the average score function: Sw(G; X) = Lw(G; X) + 位Rsparse(G) = i=1 wil(xi, f(xi)) + 位Rsparse(G), (5) where w = (w1, . . . , wn) is a sample reweighting vector with length n, wherein wi indicates the importance of the i-th observed sample xi. Published as a conference paper at ICLR 2023 However, the oracle sample importance is usually unavailable in real-world scenarios. The problem, hence, comes to how to automatically learn appropriate the sample reweighting vector w. Intuitively, samples easily fitted with spurious edges should contribute less to the DAG learning, while samples that do not hold spurious edges but contain critical information about causal edges should be more importance. We therefore use a simple heuristic of downweighting the easier-to-fit but less informative samples, and upweighting the less-fitted but more informative samples. This inspires us to learn to allocate weights adaptively, with the aim of maximizing the influence of less well-fitted samples and failing the DAG learner. Formally, we cast the overall framework of reweighting samples to boost causal discovery as the following bilevel optimization problem: min G Sw (G; X) + PDAG(G), s.t. w arg max w C(蟿) Sw(G; X), (6) where C(蟿) := {w : 0 < 蟿 n w1, . . . , wn 1 蟿n, Pn i=1 wi = 1} for the cutoff threshold 蟿 (0, 1). The deviation of the weight distribution from the uniform distribution is bound by the hyperparameter 蟿. Clearly, Equation 6 consists of two objectives, where the inner-level objective (i.e., optimize w by maximizing the reweighted score function) is nested within the outer-level objective (i.e., optimize G by minimizing the differentiable score-based loss). Solving the outer-level problem should be subject to the optimal value of the inner-level problem. Now we introduce how to solve this bilevel optimization problem. In the inner loop, we first fix the DAG learner which evaluates the error of each observed sample xi, i {1, , n}, and then maximize the reweighted score function to learn the weight w i correspondingly. In the outer loop, upon the reweighted observations whose weights are determined in the inner loop, we minimize the reweighted score function to optimize the DAG learner. By alternately training the inner and outer loops, the importance of each sample is adaptively estimated based on the DAG learner s error, and in turn gradually guides the DAG learner to perform better on the informative samples. It is worth highlighting that this Re Score scheme can be applied to any differentiable score-based causal discovery method listed in Section 2. The procedure of training Re Score is outlined in Algorithm 1. Furthermore, our Re Score has the following desirable advantages: As shown in Section 3.2, under mild conditions, our Re Score inherits the identifiability property of the original differentiable score-based causal discovery method. Re Score is able to generate adaptive weights to observations through the bilevel optimize, so as to distinguish more information samples and fulfill their potentials to guide the DAG learning. This is consistent with our theoretical analysis in Section 3.3 and empirical results in Section 4.2. Re Score is widely applicable to various types of data and models. In other words, it is modelagnostic and can effectively handle heterogeneous data without knowing the domain annotations in advance. Detailed Re Score performance can be found in Section 4. 3.2 THEORETICAL ANALYSIS ON IDENTIFIABILITY The graph identifiability issue is the primary challenge hindering the development of structure learning. As an optimization framework, the most desired property of Re Score is the capacity to ensure graph identifiability and substantially boost the performance of the differentiable score-based DAG learner. We develop Theorem 1 that guarantees the DAG identifiability when using Re Score. Rendering a DAG theoretically identifiable requires three standard steps (Peters et al., 2014; Zheng et al., 2020; Ng et al., 2022): (1) assuming the particular restricted family of functions and data distributions of SEM in Equation 1; (2) theoretically proving the identifiability of SEM; and (3) developing an optimization algorithm with a predefined score function and showing that learned DAG asymptotically converges to the ground-truth DAG. Clearly, Re Score naturally inherits the original identifiability of a specific SEM as stated in Section 2. Consequently, the key concern lies on the third step whether the DAG learned by our new optimization framework with the reweighted score function Sw(G; X) can asymptotically converge to the ground-truth DAG. To address this, we present the following theorem. Specifically, it demonstrates that, by guaranteeing the equivalence of optimization problems (Equation 2 and Equation 6) in linear systems, the bounded weights will not affect the consistency results in identifiability analysis. See detailed proof in Appendix C.1. Published as a conference paper at ICLR 2023 Theorem 1. Suppose the SEM in Equation 1 is linear and the size of observational data X is n. As the data size increases, i.e., n , Sw(G; X) + PDAG(G) arg min G S(G; X) + PDAG(G) a.s. 0 in the following cases: a. Using the least-squares loss L(G; X) = 1 2n X f(X) 2 F ; b. Using the negative log-likelihood loss with standard Gaussian noise. Remark: The identifiability property of Re Score with two most common score functions, namely least-square loss and negative log-likelihood loss, is proved in Theorem 1. Similar conclusions can be easily derived for other loss functions, which we will explore in future work. 3.3 ORACLE PROPERTY OF ADAPTIVE WEIGHTS Our Re Score suggests assigning varying degrees of importance to different observational samples. At its core is the simple yet effective heuristic: the less-fitted samples are more important than the easier-to-fit samples, as they do not hold spurious edges but contain critical information about the causal edges. Hence, mining hard-to-learn causation information is promising to help DAG learners mitigate the negative influence of spurious edges. The following theorem shows the adaptiveness property of Re Score, i.e., instead of equally treating all samples, Re Score tends to upweight the importance of hard but informative samples while downweighting the reliance on easier-to-fit samples. Theorem 2. Suppose that in the optimization phase, the i-th observation has a larger error than the j-th observation in the sense that l(xi, f(xi)) > l(xj, f(xj)), where i, j {1, . . . , n}. Then, where w i , w i are the optimal weights in Equation 6. The equality holds if and only if w i = w j = 蟿 n or w i = w j = 1 蟿n. See Appendix C.2 for the detailed proof. It is simple to infer that, following the inner loop that maximizes the reweighted score function Sw(G; X), the observations are ranked by learned adaptive weights w . That is, one observation equipped with a higher weight will have a greater impact on the subsequent outer loop to dominate the DAG learning. 4 EXPERIMENTS We aim to answer the following research questions: RQ1: As a model-agnostic framework, can Re Score widely strengthen the differentiable scorebased causal discovery baselines? RQ2: How does Re Score perform when noise distribution varies? Can Re Score effectively learn the adaptive weights that successfully identify the important samples? Baselines. To answer the first question (RQ1), we implement various backbone models including NOTEARS (Zheng et al., 2018) and GOLEM (Ng et al., 2020) in linear systems, and NOTEARSMLP (Zheng et al., 2020), and Gra N-DAG (Lachapelle et al., 2020) in nonlinear settings. To answer the second question (RQ2), we compare GOLEM+Re Score, NOTEARS-MLP+Re Score to a SOTA baseline CD-NOD (Huang et al., 2020) and a recently proposed approach DICD (Wang et al., 2022), which both require the ground-truth domain annotation. For a comprehensive comparison, extensive experiments are conducted on both homogeneous and heterogeneous synthetic datasets as well as a real-world benchmark dataset, i.e., Sachs (Sachs et al., 2005). In Sachs, GES (Chickering, 2002), a benchmark discrete score-based causal discovery method, is also considered. A detailed description of the employed baselines can be found in Appendix D.1. Evaluation Metrics. To evaluate the quality of structure learning, four metrics are reported: True Positive Rate (TPR), False Discovery Rate (FDR), Structural Hamming Distance (SHD), and Structural Intervention Distance (SID) (Peters & B uhlmann, 2015), averaged over ten random trails. Published as a conference paper at ICLR 2023 Table 1: Results for ER graphs of 10 nodes on linear and nonlinear synthetic datasets. ER2 ER4 TPR FDR SHD SID TPR FDR SHD SID Random 0.08 0.07 0.93 0.18 33.2 7.3 95.6 12.2 0.09 0.17 0.93 0.09 52.3 16.7 80.3 17.7 NOTEARS 0.85 0.09 0.07 0.07 5.8 2.2 20.8 5.2 0.79 0.11 0.09 0.05 10.0 5.2 25.8 9.9 + Re Score 0.89 0.07+5% 0.08 0.09 12% 4.6 2.3+26% 12.8 7.0+63% 0.85 0.04+8% 0.05 0.04+57% 7.2 1.9+39% 24.2 8.4+7% GOLEM 0.87 0.06 0.22 0.11 6.5 3.4 13.0 6.7 0.63 0.03 0.16 0.03 17.2 1.3 48.0 13.3 + Re Score 0.88 0.06+1% 0.21 0.11+2% 6.0 3.4+8% 12.4 6.3+5% 0.66 0.04+5% 0.17 0.01 5% 16.2 1.0+6% 46.7 13.3+3% NOTEARS-MLP 0.76 0.17 0.14 0.09 7.0 3.5 17.9 10.0 0.83 0.05 0.21 0.04 10.9 1.9 28.6 12.0 + Re Score 0.73 0.07 4% 0.10 0.09+37% 6.8 2.9+3% 20.3 9.7 11% 0.94 0.06+14% 0.15 0.06+44% 6.80 2.7+60% 8.80 12.4+225% Gra N-DAG 0.88 0.06 0.02 0.03 2.7 1.6 8.70 4.8 0.98 0.02 0.12 0.03 5.4 1.1 3.70 4.71 + Re Score 0.90 0.05+2% 0.01 0.03+35% 2.4 1.1+13% 7.20 3.0+21% 0.99 0.01+1% 0.11 0.01+12% 4.80 0.6+13% 0.50 0.81+640% (a) TPR w.r.t. 位 (b) FDR w.r.t. 位 (c) SHD w.r.t. 位 (d) SID w.r.t. 位 Figure 2: Performance comparison between NOTEARS-MLP and Re Score on ER4 graphs of 10 nodes on nonlinear synthetic datasets. The hyperparameter 位 defined in Equation 2 refers to the graph sparsity. See more results in Appendix D.4 4.1 OVERALL PERFORMANCE COMPARISON ON SYNTHETIC DATA (RQ1) Simulations. The generating data differs along three dimensions: number of nodes, the degree of edge sparsity, and the type of graph. Two well-known graph sampling models, namely Erdos-Renyi (ER) and scale-free (SF) (Barab asi & Albert, 1999) are considered with kd expected edges (denoted as ERk or SFk) and d = {10, 20, 50} nodes. Specifically, in linear settings, similar to (Zheng et al., 2018; Gao et al., 2021), the coefficients are assigned following U( 2, 0.5) U(0.5, 2) with additive standard Gaussian noise. In nonlinear settings, following (Zheng et al., 2020), the groundtruth SEM in Equation 1 is generated under the Gaussian process (GP) with radial basis function kernel of bandwidth one, where fi( ) is additive noise models with Ni as an i.i.d. random variable following standard normal distribution. Both of these settings are known to be fully identifiable (Peters & B uhlmann, 2014; Peters et al., 2014). For each graph, 10 data sets of 2,000 samples are generated and the mean and standard deviations of the metrics are reported for a fair comparison. Results. Tables 1, 9 and Tables in Appendix D.4 report the empirical results on both linear and nonlinear synthetic data. The error bars depict the standard deviation across datasets over ten trails. The red and blue percentages separately refer to the increase and decrease of Re Score relative to the original score-based methods in each metric. The best performing methods are bold. We find that: Re Score consistently and significantly strengthens the score-based methods for structure learning across all datasets. In particular, it achieves substantial gains over the state-of-the-art baselines by around 3% to 60% in terms of SHD, revealing a lower number of missing, falsely detected, and reversed edges. We attribute the improvements to the dynamically learnable adaptive weights, which boost the quality of score-based DAG learners. With a closer look at the TPR and FDR, Re Score typically lowers FDR by eliminating spurious edges and enhances TPR by actively identifying more correct edges. This clearly demonstrates that Re Score effectively filters and upweights the more informative samples to better extract the causal relationship. Figure 2 also illustrates the clear trend that Re Score is excelling over NOTEARS-MLP as the sparsity penalty climbs. Additionally, as Table 7 indicates, Re Score only adds a negligible amount of computational complexity as compared to the backbone score-based DAG learners. Score-based causal discovery baselines suffer from a severe performance drop on highdimensional dense graph data. Despite the advances, beyond linear, NOTEARS-MLP and Gra N-DAG fail to scale to more than 50 nodes in SF4 and ER4 graphs, mainly due to difficulties in enforcing acyclicity in high-dimensional dense graph data (Varando, 2020; Lippe et al., 2022). Specifically, the TPR of Gra N-DAG and NOTEARS-MLP in SF4 of 50 nodes is lower than 0.2, which indicates that they are not even able to accurately detect 40 edges out of 200 ground-truth edges. Re Score, as an optimization framework, relies heavily on the performance of the score- Published as a conference paper at ICLR 2023 (a) Weights on linear data at the 1st and last epochs (b) Weights on nonlinear data at the 1st and last epochs Figure 3: Illustration of adaptive weights learned by Re Score w.r.t. sample loss on both linear and nonlinear synthetic data. For each dataset, the left and right plots refer to the distribution of adaptive weights at the first and last epochs in the outer loop, respectively (i.e., the value of w , when k1 = 0 and k1 = Kouter in Algorithm 1, respectively). The disadvantaged but more informative samples are represented by the red dots. The dominant and easy samples, in contrast, are in blue. based backbone model. When the backbone model fails to infer DAG on its own as the number of nodes and edge density increase, adding Re Score will not be able to enhance the performance. 4.2 PERFORMANCE ON HETEROGENEOUS DATA (RQ2) 4.2.1 EVALUATION ON SYNTHETIC HETEROGENEOUS DATA Motivations. It is commonplace to encounter heterogeneous data in real-world applications, of which the underlying causal generating process remain stable but the noise distribution may vary. Specific DAG learners designed for heterogeneous data are prone to assume strict conditions and require the knowledge of group annotation for each sample. Group annotations, however, are extremely costly and challenging to obtain. We conjecture that a robust DAG learner is able to successfully handle heterogeneous data without the information of group annotation. Simulations. Synthetic heterogeneous data in both linear and nonlinear settings (n = 1000, d = 20, ER2) containing two distinct groups are also considered. 10% of observations come from the disadvantaged group, where half of the noise variables Ni defined in Equation 1 follow N(0, 1) and the remaining half of noise variables follow N(0, 0.1). 90% of the observations, in contrast, are generated from the dominant group where the scales of noise variables are flipped. Results. To evaluate whether Re Score can handle heterogeneous data without requiring the group annotation by automatically identifying and upweighting informative samples, we compare baseline+Re Score to CD-NOD and DICD, two SOTA causal discovery approaches that rely on group annotations and are developed for heterogeneous data. Additionally, a non-adaptive reweighting method called baseline+IPS is taken into account, in which sample weights are inversely proportional to group sizes. Specifically, we divide the whole observations into two subgroups. Obviously, a single sample from the disadvantaged group is undoubtedly more informative than a sample from the dominant group, as it offers additional insight to depict the causal edges. Table 2: Results on heterogeneous data. Linear TPR FDR SHD Nonlinear TPR FDR SHD GOLEM 0.79 0.33 18.7 NOTEARS-MLP 0.62 0.36 25.8 + IPS 0.65 0.19 18.6 + IPS 0.35 0.21 28.7 + Re Score 0.81 0.24 16.4 + Re Score 0.63 0.32 23.8 CD-NOD 0.51 0.17 24.1 CN-NOD 0.60 0.29 26.0 DICD 0.82 0.28 16.7 DICD 0.50 0.24 23.5 As Figure 3 shows, dots of different colours are mixed and scattered at the beginning of the training. After multiple iterations of training in inner and outer loops, the red dots from the disadvantaged group are gradually identified and assigned to relatively larger weights as compared to those blue dots with the same measureof-fitness. This illustrates the effectiveness of Re Score and further offers insight into the reason for its performance improvements when handling heterogeneous data. Overall, all figures show clear positive trends, i.e., the underrepresented samples tend to learn bigger weights. These results validate the property of adaptive weights in Theorem 2. Table 2 indicates that Re Score drives impressive performance breakthroughs in heterogeneous data, achieving competitive or even lower SHD without group annotations compared to CD-NOD and DICD recognized as the lower bound. Specifically, both GOLEM and NOTEARS-MLP are struggling from notorious performance drop when homogeneity assumption is invalidated, and posing hurdle from being scaled up to real-world large-scale applications. We ascribe this hurdle to blindly scoring the observational samples evenly, rather than distilling the crucial group information from Published as a conference paper at ICLR 2023 distribution shift of noise variables. To better highlight the significance of the adaptive property, we also take Baseline+IPS into account, which views the ratio of group size as the propensity score and exploits its inverse to re-weight each sample s loss. Baseline+IPS suffers from severe performance drops in terms of TPR, revealing the limitation of fixed weights. In stark contrast, benefiting from adaptive weights, Re Score can even extract group information from heterogeneous data that accomplish more profound causation understanding, leading to higher DAG learning quality. This validates that Re Score endows the backbone score-based DAG learner with better robustness against the heterogeneous data and alleviates the negative influence of spurious edges. 4.2.2 EVALUATIONS ON REAL HETEROGENEOUS DATA. Table 3: The performance comparison on Sachs dataset. TPR FDR SHD SID #Predicted Edges Random 0.076 0.899 23 63 22 GOLEM 0.176 0.026 15 53 4 + Re Score 0.294 0.063 14 49 6 NOTEARS-MLP 0.412 0.632 16 45 19 + Re Score 0.412 0.500 13 43 14 Gra N-DAG 0.294 0.643 16 60 14 + Re Score 0.353 0.600 15 58 15 GES 0.294 0.853 31 54 34 + Re Score 0.588 0.722 28 50 36 CD-NOD 0.588 0.444 15 - 18 Sachs (Sachs et al., 2005) contains the measurement of multiple phosphorylated protein and phospholipid components simultaneously in a large number of individual primary human immune system cells. In Sachs, nine different perturbation conditions are applied to sets of individual cells, each of which administers certain reagents to the cells. With the annotations of perturbation conditions, we consider the Sachs as real-world heterogeneous data (Mooij et al., 2020). We train baselines on 7,466 samples, where the ground-truth graph (11 nodes and 17 edges) is widely accepted by the biological community. As Table 3 illustrates, Re Score steadily and prominently boosts all baselines, including both differentiable and discrete score-based causal discovery approaches w.r.t. SHD and SID metrics. This clearly shows the effectiveness of Re Score to better mitigate the reliance on easier-to-fit samples. With a closer look at the TPR and FDR, baseline+Re Score surpasses the state-of-the-art corresponding baseline by a large margin in most cases, indicating that Re Score can help successfully predict more correct edges and fewer false edges. Remarkably, compared to CD-NOD, which is designed for heterogeneous data and utilizes the annotations as prior knowledge, GES+Re Score obtains competitive TPR without using ground-truth annotations. Moreover, Gra N-DAG+Re Score can reach the same SHD as CD-NOD when 15 and 18 edges are predicted, respectively. These findings validate the potential of Re Score as a promising research direction for enhancing the generalization and accuracy of DAG learning methods when dealing with real-world data. 5 CONCLUSION Today s differentiable score-based causal discovery approaches are still far from being able to accurately detect the causal structures, despite their great success on synthetic linear data. In this paper, we proposed Re Score, a simple-yet-effective model-agnostic optimization framework that simultaneously eliminates spurious edge learning and generalizes to heterogeneous data by utilizing learnable adaptive weights. Grounded by theoretical proof and empirical visualization studies, Re Score successfully identifies the informative samples and yields a consistent and significant boost in DAG learning. Extensive experiments verify that the remarkable improvement of Re Score on a variety of synthetic and real-world datasets indeed comes from adaptive weights. Two aspects of Re Score s limitations will be covered in subsequent works. First, the performance of Re Score is highly related to the causal discovery backbone models, which leads to minor improvements when the backbone methods fail. Second, having empirically explored the sensitivity to pure noise samples in D.3.2, we will theoretically analyze and further enhance the robustness of Re Score against these noises. It is expected to substantially improve the DAG learning quality, as well as distinguish true informative samples from pure noise samples. We believe that Re Score provides a promising research direction to diagnose the performance degradation for nonlinear and heterogeneous data in the structure learning challenge and will inspire more works in the future. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS This research is supported by the Sea-NEx T Joint Lab, the National Natural Science Foundation of China (9227010114), and CCCD Key Lab of the Ministry of Culture and Tourism. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Francis R. Bach and Michael I. Jordan. Learning graphical models with mercer kernels. In NIPS, 2002. Albert-L aszl o Barab asi and R eka Albert. Emergence of scaling in random networks. science, 286 (5439):509 512, 1999. Kevin Bello, Bryon Aragam, and Pradeep Ravikumar. Dagma: Learning dags via m-matrices and a log-determinant acyclicity characterization. ar Xiv preprint ar Xiv:2209.08037, 2022. Ruichu Cai, Jincheng Ye, Jie Qiao, Huiyuan Fu, and Zhifeng Hao. Fom: Fourth-order moment based causal direction identification on the heteroscedastic data. Neural Networks, 124:193 201, 2020. David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002. David Maxwell Chickering, David Heckerman, and Christopher Meek. Large-sample learning of bayesian networks is np-hard. Journal of machine learning research, 5:1287 1330, 2004. Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pp. 294 321, 2012. Elliot Creager, J orn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In ICML, 2021. James Cussens. Bayesian network learning with cutting planes. In UAI, 2011. Jos e A G amez, Juan L Mateo, and Jos e M Puerta. Learning bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery, 22(1):106 148, 2011. Yinghua Gao, Li Shen, and Shu-Tao Xia. DAG-GAN: causal structure learning with generative adversarial nets. In ICASSP, 2021. Dan Geiger and David Heckerman. Learning gaussian networks. In UAI, 1994. Amir Emad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In NIPS, 2017. Amir Emad Ghassami, Negar Kiyavash, Biwei Huang, and Kun Zhang. Multi-domain causal structure learning in linear systems. In Neur IPS, 2018. Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019. Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, Isabelle Guyon, David Lopez-Paz, and Michele Sebag. Learning functional causal models with generative neural networks. In Explainable and interpretable models in computer vision and machine learning, pp. 39 80. 2018. Alain Hauser and Peter B uhlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of machine learning research, 13:2409 2464, 2012. Published as a conference paper at ICLR 2023 Yue He, Peng Cui, Zheyan Shen, Renzhe Xu, Furui Liu, and Yong Jiang. DARING: differentiable causal discovery with residual independence. In KDD, 2021. David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197 243, 1995. Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Peters, and Bernhard Sch olkopf. Nonlinear causal discovery with additive noise models. In NIPS, 2008. Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Sch olkopf, and Clark Glymour. Generalized score functions for causal discovery. In KDD, 2018. Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery and forecasting in nonstationary environments with state-space models. In ICML, 2019. Biwei Huang, Kun Zhang, Jiji Zhang, Joseph D. Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch olkopf. Causal discovery from heterogeneous/nonstationary data. Journal of machine learning research, 21:89:1 89:53, 2020. Tommi S. Jaakkola, David A. Sontag, Amir Globerson, and Marina Meila. Learning bayesian network structure using LP relaxations. In AISTATS, 2010. Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, and Mich ele Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. ar Xiv preprint ar Xiv:1803.04929, 2018. Ilyes Khemakhem, Ricardo Pio Monti, Robert Leech, and Aapo Hyv arinen. Causal autoregressive flows. In AISTATS, 2021. Mikko Koivisto and Kismat Sood. Exact bayesian structure discovery in bayesian networks. Journal of machine learning research, 5:549 573, 2004. Jack Kuipers, Giusi Moffa, and David Heckerman. Addendum on the scoring of gaussian directed acyclic graphical models. The Annals of Statistics, 42(4):1689 1691, 2014. Trent Kyono, Yao Zhang, and Mihaela van der Schaar. CASTLE: regularization via auxiliary causal graph discovery. In Neur IPS, 2020. S ebastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based neural DAG learning. In ICLR, 2020. Phillip Lippe, Taco Cohen, and Efstratios Gavves. Efficient neural causal discovery without acyclicity constraints. In ICLR, 2022. Po-Ling Loh and Peter B uhlmann. High-dimensional learning of linear causal networks via inverse covariance estimation. Journal of machine learning research, 15(1):3065 3105, 2014. Joris M. Mooij, Sara Magliacane, and Tom Claassen. Joint causal inference from multiple contexts. Journal of machine learning research, 21:99:1 99:108, 2020. Ignavier Ng, Zhuangyan Fang, Shengyu Zhu, and Zhitang Chen. Masked gradient-based causal structure learning. Co RR, abs/1910.08527, 2019a. Ignavier Ng, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. A graph autoencoder approach to causal structure learning. Co RR, abs/1911.07420, 2019b. Ignavier Ng, Amir Emad Ghassami, and Kun Zhang. On the role of sparsity and DAG constraints for learning linear dags. In Neur IPS, 2020. Ignavier Ng, S ebastien Lachapelle, Nan Rosemary Ke, Simon Lacoste-Julien, and Kun Zhang. On the convergence of continuous constrained optimization for structure learning. In International Conference on Artificial Intelligence and Statistics, pp. 8176 8198, 2022. Judea Pearl. Causality: Models, Reasoning, and Inference. 2000. Published as a conference paper at ICLR 2023 Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016. Jonas Peters and Peter B uhlmann. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. Jonas Peters and Peter B uhlmann. Structural intervention distance for evaluating causal graphs. Neural computation, 27(3):771 799, 2015. Jonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Sch olkopf. Causal discovery with continuous additive noise models. Journal of machine learning research, 15(1):2009 2053, 2014. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016. Karen Sachs, Omar Perez, Dana Pe er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523 529, 2005. Xinpeng Shen, Sisi Ma, Prashanthi Vemuri, and Gyorgy Simon. Challenges and opportunities with causal discovery algorithms: application to alzheimer s pathophysiology. Scientific reports, 10 (1):1 12, 2020. Shohei Shimizu, Patrik O. Hoyer, Aapo Hyv arinen, and Antti J. Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of machine learning research, 7:2003 2030, 2006. Abhinav Shrivastava, Abhinav Gupta, and Ross B. Girshick. Training region-based object detectors with online hard example mining. In CVPR, pp. 761 769. IEEE Computer Society, 2016. Tomi Silander and Petri Myllym aki. A simple approach for finding the globally optimal bayesian network structure. In UAI, 2006. Elena Sokolova, Perry Groot, Tom Claassen, and Tom Heskes. Causal discovery from databases with discrete and continuous variables. In European Workshop on Probabilistic Graphical Models, pp. 442 457, 2014. Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62 72, 1991. Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. In Applied informatics, volume 3, pp. 1 28, 2016. Peter Spirtes, Christopher Meek, and Thomas S. Richardson. Causal inference in the presence of latent variables and selection bias. In UAI, 1995. Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31 78, 2006. Sara Van de Geer and Peter B uhlmann. l0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013. Gherardo Varando. Learning dags without imposing acyclicity. Co RR, abs/2006.03005, 2020. Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. D ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys (CSUR), 2021. Yu Wang, An Zhang, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Differentiable invariant causal discovery. Co RR, abs/2205.15638, 2022. Yuhao Wang, Vlado Menkovski, Hao Wang, Xin Du, and Mykola Pechenizkiy. Causal discovery from incomplete data: A deep learning approach. Co RR, abs/2001.05343, 2020. Dennis Wei, Tian Gao, and Yue Yu. Dags with no fears: A closer look at continuous optimization for learning bayesian networks. In Neur IPS, 2020. Published as a conference paper at ICLR 2023 Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. Causalvae: Disentangled representation learning via neural structural causal models. In CVPR, 2021. Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. In ICML, 2019. Yue Yu, Tian Gao, Naiyu Yin, and Qiang Ji. Dags with no curl: An efficient DAG structure learning approach. In ICML, 2021. Changhe Yuan and Brandon M. Malone. Learning optimal bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48:23 65, 2013. Kun Zhang and Aapo Hyvarinen. On the identifiability of the post-nonlinear causal model. UAI, 2009. Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Dags with NO TEARS: continuous optimization for structure learning. In Neur IPS, 2018. Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Learning sparse nonparametric dags. In AISTATS, 2020. Rong Zhu, Andreas Pfadler, Ziniu Wu, Yuxing Han, Xiaoke Yang, Feng Ye, Zhenping Qian, Jingren Zhou, and Bin Cui. Efficient and scalable structure learning for bayesian networks: Algorithms and applications. In ICDE, 2021. Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. In ICLR, 2020. Published as a conference paper at ICLR 2023 A RELATED WORK Differentiable score-based causal discovery methods. Learning the directed acyclic graph (DAG) from purely observational data is challenging, owing mainly to the intractable combinatorial nature of acyclic graph space. A recent breakthrough, NOTEARS (Zheng et al., 2018), formulates the discrete DAG constraint into a continuous equality constraint, resulting in a differentiable score-based optimization problem. Recent subsequent works extends the formulation to deal with nonlinear problems by using a variety of deep learning models, such as neural networks (NOTEARS+ (Zheng et al., 2020), Gra N-DAG (Lachapelle et al., 2020), CASTLE (Kyono et al., 2020), MCSL (Ng et al., 2019a), DARING (He et al., 2021)), generative autoencoder (CGNN (Goudet et al., 2018), Causal VAE (Yang et al., 2021), ICL (Wang et al., 2020), DAG-GAN (Gao et al., 2021)), graph neural network (DAG-GNN (Gao et al., 2021), GAE (Ng et al., 2019b)), generative adversarial network (SAM (Kalainathan et al., 2018), ICL (Wang et al., 2020)), and reinforcement learning (RL-BIC (Zhu et al., 2020)). Multi-domain causal structure learning. Most multi-domain causal structure learning methods are constraint-based and have diverse definition of domains. In our paper, the multi-domain or multi-group refers to heterogeneous data whose underlying causal generating process remain stable but the distributions of noise variables may vary. In literature, our definition of multi-domain is consistent with MC (Ghassami et al., 2018), CD-NOD (Huang et al., 2020), LRE (Ghassami et al., 2017), DICD (Wang et al., 2022), and others (Peters et al., 2016). In addition to the strict restriction of knowing the domain annotation in advance, the majority of structure learning models dedicated to heterogeneous data exhibit limited applicability, due to linear case assumption (Ghassami et al., 2018; 2017), causal direction identification only (Huang et al., 2019; Cai et al., 2020), and timeconsuming (Huang et al., 2020). B ALGORITHM OF RESCORE Algorithm 1 Re Score Algorithm for Differentiable Score-based Causal Discovery Input: observational data D: {xi : i = 1, 2, ..., n}, DAG learner parameters 胃G, reweighting model parameters 胃w, cutoff threshold 蟿, epoch to start reweighting Kreweight, maximum epoch in the inner loop Kinner, maximum epoch in the outer loop Kouter Initialize: initialize 胃w to uniformly output 1 n, k1 = 0, k2 = 0 for k1 Kouter do Fix reweighting model parameters 胃w Calculate w by applying threshold [ 蟿 n蟿 ] Optimize 胃G by minimizing Sw (G; X) + PDAG(G) # Outer optimization in Equation 6 if k1 kreweight then for k2 Kinner do Fix the DAG learner s parameters 胃G Get w from 胃w by applying threshold [ 蟿 n蟿 ] Optimize 胃w by maximizing Sw(G; X) # Inner optimization in Equation 6 k2 k2 + 1 end for k1 k1 + 1 k2 0 end if end for return predicted G from DAG learner Published as a conference paper at ICLR 2023 C IN-DEPTH ANALYSIS OF RESCORE C.1 PROOF OF THEOREM 1 Theorem 1. Suppose the SEM in Equation 1 is linear and the size of observational data X is n. As the data size increases, i.e., n , Sw(G; X) + PDAG(G) arg min G S(G; X) + PDAG(G) a.s. 0 in the following cases: a. Using the least-squares loss L(G; X) = 1 2n X f(X) 2 F ; b. Using the negative log-likelihood loss with standard Gaussian noise. Proof. Let B = (尾1, . . . , 尾d) Rd d be the weighted adjacent matrix of a SEM, the linear SEM can be written in the matrix form: X = XB + N (7) where E(N|X) = 0 , Var(N|X) = diag(蟽2 1, . . . , 蟽2 d), and Bii = 0 since Xi cannot be the parent of itself. Let X Rn d be the observational data and N Rn d be the corresponding errors, then X = XB + N. The original and reweighted functions for optimization are S(B; X) + PDAG(B) = L(B; X) + 位Rsparse(B) + P(B), Sw(B; X) + PDAG(B) = Lw(B; X) + 位Rsparse(B) + PDAG(B). Comparing the above functions, only the first goodness-of-fit term are different, we will only consider this term. For the least-squares loss case, the optimization problem is min B Lw(B; X) = min B i=1 wil(xi, xi B), s.t. Bii = 0, i = 1, . . . , d. Let W = diag(w1, . . . , wn) be the n-dimensional matrix, and rewrite the loss function as i=1 wi xi xi B 2 2 i=1 wi(xi xi B)(xi xi B) j=1 wi(Xij xi尾j)2 j=1 (xj X尾j) W(xj X尾j), where xj is the j-th column in matrix X. Let Dj be the d-dimensional identify matrix by setting j-th element as 0, for j = 1, . . . , d. The above optimization is able to be written without the restriction: min B Lw(B; X) = min B j=1 (xj XDj尾j) W(xj XDj尾j) (xj) Wxj 2(xj) WXDj尾j + 尾 j D j X WXDj尾j . Published as a conference paper at ICLR 2023 The partial derivative of the loss function with respect to 尾j is 尾j = Pd j=1 (xj) WXj 2(xj) WXDj尾j + 尾 j D j X WXDj尾j = (xj) Wxj 2(xj) WXDj尾j + 尾 j D j X WXDj尾j 尾j = 2D j X Wxj + 2D j X WXDj尾j. Setting the partial derivative to zero produces the optimal parameter: 藛尾j = D j (X WX) 1Dj D j X Wxj = D j (X WX) 1Dj D j X W(XDj尾j + Nj) = Dj尾j + Dj(X WX) 1X WNj, (8) where Nj Rn is the j-th column in matrix N. In the above equation, the second equality holds because xj = XDj尾j + Nj. Similarly, one can easily obtain that the optimum parameter for ordinary mean-squared loss is 尾j = Dj尾j + Dj(X X) 1X Nj. (9) It is obvious that the difference between Equation 8 and Equation 9 is the second term. Compute the mean and variance matrix of the second term in Equation 8, we can get E (X WX) 1X WNj = E E (X WX) 1X WNj|X = E (X WX) 1X W E Nj|X Var (X WX) 1X WNj = E (X WX) 1X WNj(Nj) W X(X WX) 1 E (X WX) 1X WNj E (X WX) 1X WNj = E E (X WX) 1X WNj(Nj) WX(X WX) 1 X = E (X WX) 1X W E Nj(Nj) |X WX(X WX) 1 = E (X WX) 1X W E Nj(Nj) |X WX(X WX) 1 = 蟽2 j E (X WX) 1X W 2X(X WX) 1 . The last equality holds because E(NN |X) = Var(N|X) + E(N|X)[E(N|X)] = diag(蟽2 1, . . . , 蟽2 d). Since w C(蟿), it is easy to know that the variance matrix is finite. By the Kolmogorov s strong law of large numbers, the second term converges to zero, thus Published as a conference paper at ICLR 2023 which is same as the ordinary case. Since noise N = (N1, . . . , Nd) are jointly independent, the previous process can be apply to the other j {1, . . . , d}. Let 藛B = (藛尾1, . . . , 藛尾d) and B = ( 尾1, . . . , 尾d), then 藛B B a.s. 0. Therefore, the convergence has been shown for case a. Since the noise follows a Gaussian distribution, i.e. X XB = N = (N1, . . . , Nd) N 0 , diag(蟽2 1, . . . , 蟽2 d) , the loss function (negative log-likelihood function) is (Xij xi尾j)2 i=1 wi log 蟽j wi 2蟽2 j (Xij xi尾j)2 i=1 wi log 蟽j 1 2蟽2 j (xj X尾j) W(xj X尾j). (10) To minimize the loss function above w.r.t. B, it is equivalent to minimize the second term in Equation 10: min B Lw(B; X) min B 1 2蟽2 j (xj X尾j) W(xj X尾j). It can be seen that the RHS above is similar to the loss function in case a. except the coefficients 1 2蟽2 j , j = 1, . . . , d. Therefore, one can use same approaches to get the equivalence result for case b. Consequently, the proofs of the two special cases have been done. C.2 PROOF OF THEOREM 2 Theorem 2. Suppose that in the optimization phase, the i-th observation has a larger error than the j-th observation in the sense that l(xi, f(xi)) > l(xj, f(xj)), where i, j {1, . . . , n}. Then, where w i , w i are the optimal weights in Equation 6. The equality holds if and only if w i = w j = 蟿 n or w i = w j = 1 蟿n. Proof. We will show the theorem by contradiction. Without loss of generality, let i = 1, j = 2, and suppose w 1