# reliable_and_efficient_anytime_skeleton_learning__c6da6fbf.pdf

Reliable and Efficient Anytime Skeleton Learning

Rui Ding,1 Yanzhi Liu,

2* Jingjing Tian,

3* Zhouyu Fu,

4** Shi Han,1 Dongmei Zhang1

1Microsoft Research

2Beijing University of Posts and Telecommunications

3Peking University

4Alibaba Group 1{juding, shihan, dongmeiz}@microsoft.com, 2lyztyj@bupt.edu.cn, 3tianjj97@pku.edu.cn, 4zhouyu.fz@alibaba-inc.com

Abstract Skeleton Learning (SL) is the task for learning an undirected graph from the input data that captures their dependency relations. SL plays a pivotal role in causal learning and has attracted growing attention in the research community lately. Due to the high time complexity, anytime SL has emerged which learns a skeleton incrementally and improves it overtime. In this paper, we first propose and advocate the reliability requirement for anytime SL to be practically useful. Reliability requires the intermediately learned skeleton to have precision and persistency. We also present REAL, a novel Reliable and Efficient Anytime Learning algorithm of skeleton. Specifically, we point out that the commonly existing Functional Dependency (FD) among variables could make the learned skeleton violate faithfulness assumption, thus we propose a theory to resolve such incompatibility. Based on this, REAL conducts SL on a reduced set of variables with guaranteed correctness thus drastically improves efficiency. Furthermore, it employs a novel edge-insertion and best-first strategy in anytime fashion for skeleton growing to achieve high reliability and efficiency. We prove that the skeleton learned by REAL converges to the correct skeleton under standard assumptions. Thorough experiments were conducted on both benchmark and realworld datasets demonstrate that REAL significantly outperforms the other state-of-the-art algorithms.

Introduction

Skeleton Learning (SL) plays a pivotal role in causal learning and knowledge discovery and has attracted growing attention in the research community lately (Shanmugam et. al. 2015; Peters, Janzing and Schölkopf 2017, Bühlmann and Meinshausen 2016). It is the task for learning an undirected graph from the input data that captures the variables dependency relations. Under causal sufficiency assumption (i.e., the set of observed variables 𝒱 include all the common causes of pairs in 𝒱), each edge A B in a skeleton has causal semantic: there exists direct cause- * The contributions by Yanzhi Liu and Jingjing Tian have been conducted and completed during their internships at Microsoft Research Asia. ** The contributions by Zhouyu Fu have been conducted and completed during his employment at Microsoft Research Asia. Copyright 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

effect relationship between A and B but the direction is unknown. Thus for causal learning, SL is the primary step for performing interventions over directly linked variables in skeleton to discover cause effects (Kocaoglu, Dimakis and Vishwanath 2017; Lindgren et. al. 2018). In addition, SL has great potential in knowledge discovery (Chen et al., 2012). For example, an engineer tries to diagnose the causes of service degradation and how its impact propagates. SL has a high computational complexity (Chickering 1996). The cost of waiting till the final solution is often unacceptable for practical tasks that require decisions to be made within given time budget, e.g., 30 minutes for investigating a cloud service degradation (Lou et al., 2017). On the other hand, a partially learned skeleton can also provide important clues for obtaining local causal relations (Pearl 2009) or for dependency analysis (Daly et al., 2006). This motivates the development of anytime SL techniques to incrementally learn a skeleton and improve it over time. Two challenges need to be addressed to make anytime SL practically useful. One is reliability and the other is efficiency. The reliability challenge entails two requirements precision and persistency. Precision means that the partially learned skeleton S(𝑡) at any time 𝑡 should contain only trustable edges that correctly encode the dependencies between variables. Persistency means that the skeleton 𝑆(𝑡+ 1) updated at time 𝑡+ 1 should remove as few as possible edges from previously learned skeleton 𝑆(𝑡). This enables users to reuse their analysis models obtained from the previous step over newly added edges rather than start from scratch. To the best of our knowledge, the reliability issue was not addressed in the existing work. Another major source of complication for SL stems from the inter-variable Functional Dependency (Vardi 1987) in real-world data. For FD (i.e., short for FD relationship) between two variables, the value of one variable can be fully determined by the other. For instance, Country determines Continent, or Date determines Year. Using terms from relational database domain, FDs typically exist between the primary key and other non-key variables within a table. Thus, FD is prevalent in tabular datasets composed by joining multiple tables. We notice that certain FDs

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

could make the learned skeleton violate faithfulness assumption. Faithfulness (Pearl, 2009) is an important assumption that has been widely adopted in many independence-based approaches for BNSL (Spirtes et. al. 2001; Tsamardinos, Brown and Aliferis 2006). In our problem, to deal with such incompatibility, we suggest a weaker assumption called harmony (i.e., the data distribution should satisfy Global Markov Property with respect to the learned graph G and G should be minimum) instead of faithfulness for SL, thus a theory for harmonious SL must be developed. In this paper, we present REAL, a novel Reliable and Efficient Anytime skeleton Learning algorithm to achieve high reliability and high efficiency. Given FDs as input, we first provide a theory applied to FDs to create an FD induced graph 𝐺𝐹𝐷, and identify a reduced set of variables for subsequent skeleton learning algorithm. We prove that the learned skeleton following this way is guaranteed to be harmonious, thus both reliability and efficiency are significantly improved. In contrary, existing SL algorithms do not take FD into consideration, which is more time consuming and is also prone to generate spurious edges in the learned skeleton. To further achieve high reliability, REAL exploits a bottom-up skeleton growing strategy for anytime SL. Specifically, REAL starts from an empty graph, and incrementally adds edges into the skeleton by a robust local evaluation procedure called Robust PCFinder. Robust PCFinder aims to find candidates of neighbors for each target variable T with low false positives. Furthermore, due to the property of Robust PCFinder, the growing of local neighbors for each variable is irrelevant to the order of variables prioritized, we thus propose a best-first search strategy to effectively utilize the time budget by selecting variable with lowest overhead (i.e., time cost for local evaluation) first. When timeout or user interrupts, skeleton growing is stopped, and a lightweight post-processing is conducted to further remove false positives. In summary, we have made the following contributions. 1. To the best of our knowledge, we are the first to propose reliability (i.e., precision and persistency) as an essential requirement for building anytime SL algorithms to be practically useful. 2. We provide a theory to resolve incompatibility between FD and faithfulness, which significantly improves the reliability and efficiency of the skeleton to be learned for datasets with FDs. 3. We propose a novel anytime skeleton growing algorithm by edge-insertion and best-first strategy to achieve high reliability and efficiency. We prove that the skeleton learned by REAL converges to the correct skeleton under standard assumptions.

4. We conduct thorough evaluations on benchmark and real-world datasets to demonstrate that REAL significantly outperforms the other state-of-the-art algorithms.

Preliminaries

A dataset D consists of 𝑁 records and 𝑑 categorical columns, which represents 𝑁 instances drawn i.i.d. from 𝑑 discrete variables 𝒱= {𝑉1, 𝑉2, , 𝑉𝑑} by a joint probability distribution 𝑃𝒱. Denote the cardinality of 𝑉𝑖 (i.e., number of distinct values in the i-th column of D) as 𝑐𝑖, and 𝐶= {𝑐1, 𝑐2, , 𝑐𝑑} is the cardinality set. To avoid degeneracy, we restrict 𝑐𝑖> 1, 𝑖. Markov Factorization Property. Given a distribution 𝑃𝒱 and a Directed Acyclic Graph (DAG) 𝐺, 𝑃𝒱 is said to satisfy Markov factorization property or Markovian (with respect) to 𝐺 if 𝑃𝒱 = 𝑃(𝑉1, 𝑉2, , 𝑉𝑑) = 𝑃(𝑉𝑖|𝑝𝑎𝑖

𝐺) 𝑑 𝑖=1 , where 𝑝𝑎𝑖

𝐺 is the parent set of 𝑉𝑖 in 𝐺. Markov factorization property is the basis of graphical models, it encodes conditional independences in the distribution that we can exploit for learning graph structure. Global Markov Property (GMP). A distribution 𝑃𝒱 is said to satisfy Global Markov Property or GMP (with respect) to a DAG G if 𝑋 𝐺𝑌|𝑍 𝑋 𝑌|𝑍. Here 𝐺 denotes d-separation, and denotes statistical independence. GMP indicates that any d-separation in graph G implies conditional independence in distribution 𝑃𝒱. GMP is equivalent to Markov factorization property (Lauritzen 1996). Minimality. A distribution 𝑃𝒱 satisfies minimality (with respect) to DAG G if it is Markovian to G, but not proper to any subgraph of G. Minimality coincides with Occam s Razor principle (Pearl 2009): when distribution 𝑃𝒱 is Markovian to both two graphs G and G , we prefer to use the simpler one to interpret the data. In fact, minimality is a necessary constraint for DAG learning. Without minimality constraint, we can always construct a trivial graph 𝐺 by setting 𝑝𝑎𝑖

𝐺 = {𝑉1, 𝑉𝑖 1}, so that 𝑃𝒱 is always Markovian to 𝐺 because 𝑃(𝑉1, 𝑉2, , 𝑉𝑑) 𝑃(𝑉𝑖|𝑉1~𝑉𝑖 1) 𝑑 𝑖=1 = 𝑃(𝑉𝑖|𝑝𝑎𝑖

𝐺 ) 𝑑 𝑖=1 . Faithfulness. a distribution 𝑃𝒱 is faithful (with respect) to a DAG G if 𝑋 𝑌|𝑍 𝑋 𝐺𝑌|𝑍. Faithfulness implies minimality (Peters, Janzing and Schölkopf 2017). In practice faithfulness is a strong assumption that could be violated in different ways. In our problem, faithfulness can be violated when data contains FDs.

Approach Overview

As depicted in Figure 1, the raw data is first fed into the module called FD Solver, which outputs a simplified FD graph 𝐺𝐹𝐷

and a subset of variables 𝒱 for skeleton learning module. FD Solver first identifies FDs and then resolves the incompatibility between FDs and faithfulness assumption by outputting 𝐺𝐹𝐷

and 𝒱 . 𝒱 is ensured to contain no FDs. Then SL is conducted on 𝒱 by incremental skeleton-growing which is designed in an anytime fashion that whenever time is expired (i.e., either by user interruption or timeout), the partially learned skeleton is output to

the post-processing module. Post-processing module further removes false-positives and then return the best-so-far skeleton to user. FD Solver runs fast with time complexity 𝑂(𝑁𝑑2). Skeleton growing module is time consuming, thus it is designed in an incremental manner. Starting from an empty graph, it iteratively selects a variable and tries to insert an edge on it. The variable selection is performed in a best-first way, so that the time budget can be effectively utilized. Post-processing is lightweight. Below we illustrate the details of each part.

Definition 1. A functional dependency 𝑋

𝐹𝐷 𝑌 means that the value of Y is determined by the value of X. The specific mapping is denoted as 𝑌= 𝑓(𝑋), where Y is a discrete variable and X is a set of discrete variables. FD (Vardi 1987) typically exists between the primary key and other non-key variables within a table. Thus, FD is prevalent in tabular datasets composed by joining multiple tables, which reflects consistent or deterministic intervariable relationship, such as Country

𝐹𝐷 Continent. In our problem, we restrict FD 𝑋

𝐹𝐷 𝑌 to the case where |𝑋| = 1 (i.e., X is a single variable), because it is the most common FD relationship (i.e., primary key is a single column) and detecting such kind of FDs is lightweight. FDs with |𝑋| > 1 are also useful for SL, but they are rarer and time cost for detecting such cases is exponential to |𝑋|, which is unsuitable for our problem. Intuitively, given 𝑋

𝐹𝐷 𝑌, 𝑋 and 𝑌 naturally exhibit strong dependency. Furthermore, conditioning on 𝑋 would screen-off 𝑌 with other variables since 𝑌 becomes singlevalued. Thus FD has inherent connection to SL. These intuitions can be described in the following two lemmas

(NOTE: considering page limited, ALL proofs for lemmas and theorems are available at our website1): Lemma 1. If 𝑋

𝐹𝐷 𝑌, then 𝑌 𝑋; for any other variable set 𝑍, 𝑍 𝑌|𝑋. Here symbol denotes dependent (i.e., two variables are not statistically independent). Lemma 2. If 𝑋

𝐹𝐷 𝑌 and 𝑍 𝑋|𝑊, then 𝑍 𝑌|𝑊. Where 𝑍 and 𝑊 are disjoint sets of variables other than 𝑋 and 𝑌. FDs have the following properties: Transitivity: if 𝑋

𝐹𝐷 𝑍, then 𝑋

𝐹𝐷 𝑍 Equivalence: If 𝑋

𝐹𝐷 𝑋, then 𝑋 is one-to-one correspondence of 𝑌, denoted as 𝑋 𝑌. FD equivalent variables are indistinguishable for skeleton learning tasks. If a dataset contains FD equivalent groups, we will only keep one variable from each FD equivalent group by pre-processing. We assume no variables are FD equivalent in subsequent discussion. Definition 2. (FD induced graph). Given a dataset D with n variables 𝑋1~𝑋𝑛, an FD induced graph 𝐺𝐹𝐷 is a directed graph with n variables 𝑋1~𝑋𝑛, by assigning a directed edge from 𝑋𝑖 to 𝑋𝑗 whenever 𝑋𝑖

𝐹𝐷 𝑋𝑗 in D. 𝐺𝐹𝐷 is a set of connected components, where each component is a DAG. This is ensured since we have eliminated FD equivalent variables at preprocessing stage.

Violation of Faithfulness Induced by FDs

When a dataset contains FDs, the faithfulness assumption can be violated: Figure 2 shows a 𝐺𝐹𝐷 of 3 variables X, Y and Z. According to lemma 1, 𝑃𝒱 implies two conditional independencies: 𝑍 𝑌|𝑋 and 𝑍 𝑋|𝑌. If faithfulness assumption is hold, then {𝑍 𝑌|𝑋; 𝑍 𝑋|𝑌} {𝑍 𝐺𝑌|𝑋; 𝑍 𝐺𝑋|𝑌} {no edge between Z and Y; no edge between Z and X} 𝑍 𝐺𝑌 𝑍 𝑌 contradiction: because 𝑌

𝐹𝐷 𝑍 𝑌 𝑍 by lemma 1. Therefore, if we take faithfulness assumption to deal with variables have FDs, we could fail to get a DAG that is Markovian to 𝑃𝒱.

Skeleton Learning with FDs As illustrated above, the issue is due to faithfulness can be invalid when data contains FDs. On the other hand, GMP and minimality are weaker assumptions without such issue. We thus define harmonious: Definition 3 (harmonious). A DAG G is said to be harmonious with respect to a distribution 𝑃𝒱 if 𝑃𝒱 satisfies both GMP and minimality to G. Now we formulate our graph structure learning as follows:

1 https://www.microsoft.com/en-us/research/project/real/

Best-First Skeleton Growing

Training Data

Resolving Incompatibility

(FD Solver)

Incremental

Output Time expired?

Yes Subset of variables

Simplified FD graph 𝐺𝐹𝐷

Post-Processing Pruning AND-Rule

Figure 1. Overview of REAL's workflow

X Y Z FD FD

Figure 2. Violation of faithfulness by simple FD structure

FD-DAG Problem (DAG Learning with FDs). Given a dataset D, corresponding distribution 𝑃𝒱, and the FD induced graph 𝐺𝐹𝐷 of D, given that faithfulness assumption is only valid for variables containing no FDs. Find a DAG G such that G is harmonious to 𝑃𝒱. Solving FD-DAG problem also solves the problem of skeleton learning since skeleton is the undirected version of G. Now we propose a theory to solve FD-DAG problem.

Solving FD-DAG Problem We analyze how to obtain edges in the learned DAG without conducting learning when data contains FDs. Our analysis starts from two typical structures of 𝐺𝐹𝐷.

Single-Root Structure

If 𝐺𝐹𝐷 has only one root variable (i.e., variable with no parents), we say 𝐺𝐹𝐷 is a single-root structure. An example is shown at left-hand side of Figure 3, where X is the single root variable. Theorem 1. If 𝐺𝐹𝐷 of variables {𝑋, 𝑍1, , 𝑍𝑛} is a singleroot structure with root X, then a harmonious DAG G is a single-root structure with root X, all the other variables are directly linked from X, but no other edges exist in G.

Single-Sink Structure

If 𝐺𝐹𝐷 of variables {𝑋1, , 𝑋𝑛, 𝑍} has only one sink variable Z (i.e., variable with no children) and all the other variables 𝑋𝑖 are root variables linked to Z, we say 𝐺𝐹𝐷 is a single-sink structure. An example is shown at left-hand side of Figure 4, where Z is the single sink variable. Lemma 3. Suppose 𝑋1 𝑋2 in a DAG G, and no more edges link 𝑋2 with other variables. If G is Markovian to a distribution 𝑃𝒱, then 𝑋2 𝐺𝑌|𝑈 𝑋1 𝐺𝑌|𝑈, 𝑌 {𝑋3, 𝑋𝑛}, 𝑈 {𝑋3, 𝑋𝑛}, 𝑌 𝑈. Theorem 2. If 𝐺𝐹𝐷 of variables {𝑋1, , 𝑋𝑛, 𝑍} is a singlesink structure with sink Z, then a harmonious DAG G is union of two graphs 𝐺1 and 𝐺2, where 𝐺1 is a DAG learned

by any sound and complete DAG-learn algorithm over {𝑋1, , 𝑋𝑛}, and 𝐺2 is 𝑋1 𝑍. The right part of Figure 4 is an example of theorem 2.

General Structure

The analysis on single-root and single-sink structures suggest two ways to simplify the FD-DAG problem in general case. In 𝐺𝐹𝐷, when a set of variables share a common root X (i.e., single-root structure), theorem 1 suggests these variables are directly linked from X in the learned DAG G; When a variable Z can be determined by multiple roots (i.e., single-sink structure), theorem 2 suggests we can just pick an arbitrary root 𝑋1, links Z from 𝑋1, and disconnect Z to any other variables in the learned DAG G. Below we show these suggestions are indeed correct in general. We first define simplified FD graph : Definition 4 (simplified FD graph). In 𝐺𝐹𝐷 over 𝑛+ 𝑚 variables, denote the root variables as {𝑋1, 𝑋𝑛}, and the other variables as {𝑍1, , 𝑍𝑚}. A simplified FD graph 𝐺𝐹𝐷

is a subgraph of 𝐺𝐹𝐷 that each 𝑍𝑖 only preserves one edge from one root that links to it, and removes all the other edges. Figure 5 shows an example.

Theorem 3. In 𝐺𝐹𝐷 over variables {𝑋1, , 𝑋𝑛, 𝑍1, , 𝑍𝑚}, denote the root variables as {𝑋1, , 𝑋𝑛}, denote the simplified FD graph as 𝐺𝐹𝐷

. Then a harmonious DAG G is union of two graphs 𝐺1 and 𝐺𝐹𝐷

, where 𝐺1 is a DAG learned by any sound and complete DAG-learn algorithm over root variables {𝑋1, , 𝑋𝑛}.

According to theorem 2, when there are multiple roots share same sink in 𝐺𝐹𝐷, the learned harmonious DAG is not unique. Below we show that when a variable Z has only one parent X in 𝐺𝐹𝐷, then the edge between X and Z exists in every harmonious DAG. Theorem 4 (Uniqueness). If a variable Z can only be functionally determined by another variable X, then the edge from X to Z exists in every harmonious DAG G. The uniqueness property provides convenience for experimental evaluation. We use these unique edges to check if a skeleton-learn algorithm can identify them or not.

Variables Reduction

Theorem 3 shows that when data contains FDs, we can use a subset of variables for subsequent SL algorithm, without loss of correctness, thus reduce the computational cost.

Z1 Z2 Z3 Z4

Learned DAG FD

Figure 3. Example of single-root structure

DAG G1 learned by {X1~Xn}

𝐺𝐹𝐷 Learned DAG

Figure 4. Example of single-sink structure

GFD: FD induced graph

G FD: simplified FD graph

Isolated root

DAG G1 learned by {X1~Xn}

Learned DAG

Figure 5. Example of general structure and simplified FD graph

Corollary 1. Given 𝐺𝐹𝐷, the set of variables used for subsequent SL is the set of root variables in 𝐺𝐹𝐷, these variables do not have FDs with each other. It is worthy to mention that, FD aided SL generates more stable skeleton than flat approach, because CI (conditional independence) tests based or score-based DAG-learn algorithm cannot guarantee to find correct FD edges (e.g., FD edges with uniqueness property), but FDs typically reflect stable and consistent inter-variable knowledge.

Best-First Skeleton Learning

REAL s skeleton learning strategy is in a growing manner which starts from an empty graph and incrementally inserting edges. The newly added edges should contain few false positives. To accommodate these considerations, REAL maintains a variable set 𝑅𝑎𝑤𝑃𝐶[𝑇] to store each variable 𝑇 s raw PCs (i.e., set of parents and children before postprocessing). Once the PC growing for T is complete, 𝑅𝑎𝑤𝑃𝐶[𝑇] is pruned to get 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] . The output 𝑃𝐶[𝑇] is based on 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] by applying AND-Rule to appropriate variable pairs.

Adding an edge. Once a variable T is selected, 𝑅𝑎𝑤𝑃𝐶[𝑇] is grown by iteratively adding a new, most-likely PC candidate by calling Robust PCFinder (see left part of Figure 6). A sub-procedure Robust Corr is used to measure the likelihood of a variable V to be a PC candidate of T (right part of Figure 6). If there exists independence between V and T by conditioning on a subset of current 𝑅𝑎𝑤𝑃𝐶[𝑇], which indicates V cannot be a PC of T (an implication if we assume faithfulness), thus V is discarded (line 2~5 in Robust Corr); otherwise, Robust Corr returns the kth percentile among the collection of correlations between V and T by conditioning on all possible subsets of current PC as a measure of the likelihood (line 7). Robust PCFinder picks the variable with highest likelihood and insert it into 𝑅𝑎𝑤𝑃𝐶[𝑇]. Practically, kth percentile is a robust statistic rather than Max/Mean which is more tolerant to the unstable CI tests when sample size goes smaller. In this way, the learned Raw PC achieves high precision. Best-First Skeleton Growing. Robust PCFinder only takes variable T and its current Raw PC[T] as input, thus its next

picked variable 𝑉 is irrelevant to the status of PCs of other variables. In other words, how Raw PC[T] grows is irrelevant to the order of variables prioritized for calling Robust PCFinder. Also note that, as the size of Raw PC[T] becomes larger, adding a new PC to T is more time consuming. Therefore, we could dynamically prioritize the variables (i.e., arrange the tasks to call Robust PCFinder), to let the easier tasks be executed earlier. In this way, given a limited time budget, we could output as many edges as possible by using a best-first skeleton growing strategy. Below we show how to prioritize variables. Lemma 4. Denote 𝑁𝑢𝑝𝑝𝑒𝑟 as the upper bound of time complexity of procedure 𝑅𝑜𝑏𝑢𝑠𝑡𝑃𝐶𝐹𝑖𝑛𝑑𝑒𝑟(𝑇, 𝑅𝑎𝑤𝑃𝐶[𝑇]), then 𝑁𝑢𝑝𝑝𝑒𝑟= 𝑁𝑅𝑎𝑤𝑃𝐶[𝑇]𝑐𝑇 𝑐𝑗 𝑗:𝑉𝑗 𝑇,𝑉𝑗 𝑅𝑎𝑤𝑃𝐶[𝑇] where 𝑁𝑅𝑎𝑤𝑃𝐶[𝑇] = (𝑐𝑙+ 1) 𝑙:𝑉𝑙 𝑅𝑎𝑤𝑃𝐶[𝑇] . Thus, time complexity of Robust PCFinder mainly depends on cardinalities of T and its PC. 𝑁𝑢𝑝𝑝𝑒𝑟 can be computed or updated incrementally and quickly. Definition 5 (priority). We define the priority of a variable as the reciprocal of its 𝑁𝑢𝑝𝑝𝑒𝑟: 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦= 1/𝑁𝑢𝑝𝑝𝑒𝑟

Best-first strategy. As shown in algorithm 1 (left part of Figure 7), a priority queue ℚ is maintained, with each item is a 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦 pair and priority is calculated by definition 5. Iteratively, REAL picks the variable with highest priority, executes Robust PCFinder on it, and then update priority and put it back into the queue (line 2~6). Post Processing. When timeout, skeleton growing is stopped, and a relatively lightweight post-processing is conducted to further remove false positives. The post processing consists of two parts, Prune for each variable (line 12~15) and AND-Rule checking for those variable pairs whose Raw PCs are both completely acquired (line 17~19). The procedure Prune is shown on the right top of Figure 7. It eliminates false positives in Raw PC[T]: if any variable V in Raw PC[T] is independent of T by conditioning on a subset of Raw PC[T]\{V}, then V is not a PC of T. This is

Algorithm 1 Best First Growing Input: priority queue ℚ, variable set 𝒱 1: while ℚ is not empty and no interrupt do 2: 𝑇 variable with highest priority in ℚ 3: 𝑉 𝑅𝑜𝑏𝑢𝑠𝑡𝑃𝐶𝐹𝑖𝑛𝑑𝑒𝑟(𝑇, 𝑅𝑎𝑤𝑃𝐶[𝑇]) 4: if 𝑉 is not null 5: 𝑅𝑎𝑤𝑃𝐶[𝑇] 𝑅𝑎𝑤𝑃𝐶[𝑇] {𝑉} 6: 𝑃𝑟𝑖𝑜𝑟𝑖𝑡𝑦[𝑇] 𝑈𝑝𝑑𝑎𝑡𝑒(𝑇, 𝑅𝑎𝑤𝑃𝐶[𝑇]) 7: else 8: remove 𝑇 from ℚ 9: 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] 𝑃𝑟𝑢𝑛𝑒(𝑇, 𝑅𝑎𝑤𝑃𝐶[𝑇]) 10: ℎ𝑎𝑠𝑃𝑟𝑢𝑛𝑒𝑑[𝑇] 𝑡𝑟𝑢𝑒 11: end while 12: foreach pair 𝑋, 𝑌 in 𝒱 do 13: if ℎ𝑎𝑠𝑃𝑟𝑢𝑛𝑒𝑑[𝑋] is true, ℎ𝑎𝑠𝑃𝑟𝑢𝑛𝑒𝑑[𝑌] is true 14: 𝑃𝐶[𝑋], 𝑃𝐶[𝑌] 𝐴𝑁𝐷𝑅𝑢𝑙𝑒(𝑋, 𝑌) 15: if interrupt occurred 16: foreach 𝑋 in 𝒱 do 17: if ℎ𝑎𝑠𝑃𝑟𝑢𝑛𝑒𝑑[𝑋] is false 18: 𝑃𝐶[𝑋] 𝑃𝑟𝑢𝑛𝑒(𝑋, 𝑅𝑎𝑤𝑃𝐶[𝑋]) 19: return 𝑃𝐶

Procedure Prune Input: 𝑇, 𝑅𝑎𝑤𝑃𝐶[𝑇] 1: 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] 𝑅𝑎𝑤𝑃𝐶[𝑇] 2: foreach 𝑋 in 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] 3: 𝑂𝑡ℎ𝑒𝑟 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇]\{𝑋} 4: if 𝑅𝑜𝑏𝑢𝑠𝑡𝐶𝑜𝑟𝑟(𝑋, 𝑇, 𝑂𝑡ℎ𝑒𝑟) is null 5: 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇] 𝑂𝑡ℎ𝑒𝑟 6: return 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑇]

Procedure ANDRule Input: 𝑋, 𝑌 1: 𝑃𝐶[𝑋] 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑋], 𝑃𝐶[𝑌] 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑌] 2: if 𝑋 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑌] 3: 𝑃𝐶[𝑋] 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑋]\{𝑌} 4: if 𝑌 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑋] 5: 𝑃𝐶[𝑌] 𝑃𝑟𝑢𝑛𝑒𝑑𝑃𝐶[𝑌]\{𝑋} 6: return 𝑃𝐶[𝑋], 𝑃𝐶[𝑌]

Figure 7. Algorithm for Best-First Skeleton Learning

Sub-Procedure Robust Corr Input: 𝑇, and its current 𝑅𝑎𝑤𝑃𝐶, 𝑉 1: 𝑅 2: foreach 𝑆 𝑅𝑎𝑤𝑃𝐶 do 3: 𝑟 𝐶𝑜𝑟𝑟(𝑉, 𝑇|𝑆) 4: if 𝑟< 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 5: return null 6: 𝑅 𝑅 {𝑟} 7: return kth percentile of 𝑅

Procedure Robust PCFinder Input: 𝑇, and its 𝑅𝑎𝑤𝑃𝐶 Output: A new variable most likely to be a 𝑃𝐶 of 𝑇 1: 𝒱 = 𝒱\{𝑅𝑎𝑤𝑃𝐶 {𝑇}} 2: 𝑟 max𝑉 𝒱 𝑅𝑜𝑏𝑢𝑠𝑡𝐶𝑜𝑟𝑟(𝑉, 𝑇, 𝑅𝑎𝑤𝑃𝐶) 3: if 𝑟 is null 4: return null 5: 𝑉 argmax𝑉 𝒱 𝑅𝑜𝑏𝑢𝑠𝑡𝐶𝑜𝑟𝑟(𝑉, 𝑇, 𝑅𝑎𝑤𝑃𝐶) 6: return {𝑉 , 𝑟 }

Figure 6. Procedure of Robust PCFinder

valid even when the Raw PC[T] has not been fully acquired due to timeout. In summary, the variables eliminated by Prune are guaranteed to be false positives, thus the precision is further improved. Note that pruning will also be conducted during skeleton growing once all the raw PCs for a specific variable are acquired (line 8~10), to reduce the workload on post-processing. AND-Rule is a standard rule applied by many independence-based approach (Margaritis & Thrun, 2000; Tsamardinos, Brown and Aliferis 2006), which is a symmetry constraint, saying that if X has an edge to T in skeleton G, then X must be in the PC of T and T must be in the PC of X. AND-Rule is only applicable when the Raw PC[X] and Raw PC[T] are completely acquired and after pruning is conducted (line 17~19). Otherwise, some true positives would be wrongly eliminated. The post-processing is lightweight thus REAL responses to user interruption quickly. Although procedure Prune conducts CI tests among subsets of Raw PC[T], many of the CI tests can be saved by using cache cumulated by calling Robust PCFinder. The size of Raw PC[T] is closer to the size of true PC set due to good performance of Robust PCFinder. Lemma 5 (soundness and completeness of skeleton learning). Under faithfulness, correct CI tests, and infinite data assumptions, given sufficient time or without user interruption, Best First Growing identifies correct skeleton. REAL. Algorithm 2 in Figure 8 shows the overall procedure of REAL. REAL first resolves incompatibility between faithfulness and FDs (line 1~3), and then Best First Growing is conducted on a subset of variables 𝒱 (line 4~6). The final output skeleton is the union of the learned skeleton G and the skeleton obtained by simplified FD induced graph 𝐺𝐹𝐷

Theorem 5 (consistency). Given faithfulness assumption is only valid for variables with no FDs, under the correct CI tests and infinite data assumptions, the skeleton learned by REAL converges to the correct skeleton.

Experiments

We conduct evaluations to compare REAL with state-ofthe-art SL algorithms on a) 19 large-scale benchmark datasets with ground-truth skeletons, and b) 3 real-world datasets, with feedback from domain experts.

Evaluation on Benchmark Datasets

Dataset. We use available benchmark datasets from Bayesian Network Repository (Scutari 2012) for evaluation. In these datasets, we focus on large (~100 variables) and very large (~1000 variables) networks, since they require and would benefit from anytime analysis due to the large size. So we choose 19 large-scale networks. We sample 20,000 records from each network. Among these datasets, pathfinder and diabetes are having unique FD edges (see theorem 4 for definition), so we use pathfinder and diabetes to evaluate effectiveness and efficiency of our FD Solver. Comparison algorithms. Due to the absence of standalone anytime SL algorithm, we choose three most relevant and competitive algorithms: MINOBS (Lee and van Beek 2017), BLIP (Scanagatta, et al. 2015) and PC (Spirtes, et al., 2001) as our baselines. MINOBS and BLIP are the state-of-the-art score-based algorithm for BNSL, while PC is an independence-based algorithm used to learn skeleton for causal discovery. We only call the skeleton learning part in PC for comparison, which is denoted as PC*. MINOBS and BLIP do not directly learn skeleton, thus we use the skeleton from their resultant DAG for comparison. Configurations of MINOBS and BLIP are optimized on these datasets: because all networks are with max-indegree 6 except win95pts is with max-in-degree=7. We set max-in-degree threshold for BLIP and MINOBS to 6 (change it to 7 would significantly degrade their efficiency). Implementation. All experiments are conducted on a machine with 3.2GHz Intel i7-8700 processor and 16 GB RAM. We use the C++ implementation of MINOBS provided by the author, the Java implementation of BLIP provided by the author. Both are latest versions. For PC* and REAL, we implemented them by C#. All code are executed in single thread. Measures. We evaluate the quality of learned skeleton by precision and 𝐹0.5 against the ground-truth skeleton. For anytime setting, precision is more important than recall due to the requirement of reliability, so we choose 𝐹0.5 measure, a commonly used measure that weighs precision higher than recall (Sasaki 2007). For an anytime SL algorithm, we record the 𝐹0.5 and precision of its output given different time budgets. Results. Figure 9 depicts precision and 𝐹0.5 for REAL/PC*/BLIP/MINOBS on top-12 largest benchmark da-

Algorithm 2 REAL Input: data 𝐷, variable set 𝒱, cardinalities 𝐶 Output: skeleton 𝐺: 1: 𝐺𝐹𝐷 obtain FD induced graph from 𝐷, 𝒱 2: 𝐺𝐹𝐷

obtain simplified FD graph from 𝐺𝐹𝐷 3: 𝒱 get root nodes in 𝐺𝐹𝐷

4: ℚ construct priority queue from 𝒱 , 𝐶 5: 𝐺0 empty graph /* skeleton growing allows timeout or user interrupt */ 6: 𝐺 Best First Growing(𝐺0, ℚ) 7: 𝐺 𝐺 𝑆𝑘𝑒𝑙𝑒𝑡𝑜𝑛(𝐺𝐹𝐷

) 8: return 𝐺

Figure 8. Algorithm of REAL

tasets (results for all 19 datasets are available in supplementary materials in our website due to page limit). REAL significantly outperforms the other algorithms both from precision and 𝐹0.5 on 10 datasets except hepar2 and diabetes.

For these two exceptional cases, REAL s precision is still significantly higher than the others while achieving similar 𝐹0.5. Note that precision is more important than recall in anytime scenario.

Figure 10 shows the evaluation result of FD Solver. Specifically, pathfinder contains 7 unique FD edges and diabetes contains 23 unique FD edges. We first find that 100% of these edges exist in ground-truth skeletons, which justifies the usefulness of theorem of uniqueness in practice. In addition, by comparing with running REAL without using FD solver (i.e., config=No FD), it is shown that time is significantly saved while both precision and 𝐹0.5 are increased. All these results strongly support the effective-

ness and efficiency of FD theory. Note that in result of MINOBS, 5 out of 7 are missed on pathfinder and 8 out of 23 are missed on diabetes (last column of Figure 10). BLIP s result is the same as MINOBS s.

Evaluation on Real-World Datasets

We also evaluate REAL on three real-world, complex datasets (Table 1). All these datasets are obtained from domain experts on their regular analytical tasks. They are with large number of records and high cardinality. Table 1 also lists out the number of unique FD edges for each dataset. Since these datasets don t have ground-truth skeleton, we conduct evaluation by: a) the effectiveness of finding unique FD edges; b) in-depth study of the learned skeleton based on experts feedback.

Table 1. Information of real datasets and results

Dataset #var #records 𝐦𝐚𝐱

𝒊 𝒄𝒊 #unique FD edges

Missed by BLIP/MINOBS School 9 27,611 1486 3 1 Bird Strikes 50 52,962 3377 2 2 Flight Delays 20 1,048,575 1060 9 3

time budget(s) 1 2 4 8 15 30 1 2 4 8 15 30 REAL 0.905 0.941 0.977 0.977 0.977 0.977 0.855 0.904 0.902 0.902 0.902 0.902 PC* 0.280 0.334 0.839 0.839 0.839 0.832 0.323 0.379 0.823 0.823 0.823 0.808 BLIP 1.000 1.000 0.967 0.978 0.978 0.949 0.876 0.868 0.901 0.911 0.914 0.903 MINOBS 1.000 1.000 0.978 0.957 0.957 0.949 0.865 0.865 0.908 0.896 0.896 0.903 REAL 0.872 0.867 0.938 0.938 0.938 0.938 0.802 0.826 0.868 0.868 0.868 0.868 PC* 0.220 0.252 0.521 0.736 0.718 0.716 0.259 0.295 0.558 0.734 0.706 0.702 BLIP 0.821 0.775 0.802 0.746 0.744 0.727 0.793 0.737 0.800 0.753 0.755 0.745 MINOBS 0.904 0.800 0.761 0.754 0.711 0.727 0.817 0.763 0.762 0.762 0.721 0.745 time budget(s) 10 20 40 80 150 300 10 20 40 80 150 300 REAL 0.879 0.976 0.976 0.971 0.971 0.971 0.817 0.878 0.878 0.827 0.827 0.827 PC* 0.414 0.417 0.411 0.411 0.407 0.391 0.448 0.450 0.444 0.444 0.439 0.421 BLIP 0.515 0.469 0.463 0.442 0.442 0.429 0.469 0.439 0.436 0.421 0.421 0.413 MINOBS 0.485 0.476 0.463 0.442 0.442 0.429 0.445 0.445 0.436 0.421 0.421 0.413 REAL 0.605 0.601 0.894 0.953 0.953 0.953 0.560 0.572 0.753 0.779 0.779 0.779 PC* 0.029 0.031 0.036 0.054 0.153 0.141 0.036 0.038 0.045 0.067 0.179 0.164 BLIP 0.757 0.757 0.710 0.662 0.662 0.647 0.653 0.653 0.656 0.620 0.624 0.613 MINOBS 0.737 0.784 0.699 0.686 0.667 0.664 0.636 0.678 0.646 0.643 0.631 0.629 REAL 0.930 0.979 0.979 0.979 0.979 0.979 0.916 0.942 0.942 0.942 0.942 0.942 PC* 0.257 0.372 0.745 0.745 0.745 0.744 0.300 0.423 0.766 0.766 0.766 0.765 BLIP 0.963 0.972 0.865 0.868 0.847 0.880 0.864 0.875 0.869 0.876 0.860 0.890 MINOBS 0.972 0.972 0.870 0.855 0.867 0.918 0.875 0.875 0.873 0.871 0.881 0.926 time budget(s) 100 200 400 800 1500 3000 100 200 400 800 1500 3000 REAL 0.620 0.581 0.815 0.846 0.846 0.846 0.611 0.595 0.700 0.718 0.718 0.718 PC* 0.011 0.011 0.014 0.020 0.045 0.066 0.013 0.014 0.017 0.025 0.055 0.081 BLIP 0.835 0.837 0.853 0.704 0.764 0.789 0.764 0.767 0.796 0.695 0.755 0.786 MINOBS 0.833 0.842 0.877 0.716 0.714 0.780 0.762 0.771 0.816 0.707 0.713 0.781 REAL 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 PC* 0.046 0.058 0.100 0.524 0.524 0.526 0.057 0.071 0.122 0.579 0.579 0.581 BLIP 1.000 1.000 0.944 0.953 0.953 0.952 0.935 0.935 0.955 0.962 0.962 0.961 MINOBS 1.000 1.000 0.949 0.956 0.953 0.950 0.935 0.935 0.959 0.965 0.962 0.960 REAL 0.558 0.774 0.998 0.998 0.998 0.998 0.503 0.699 0.841 0.841 0.841 0.841 PC* 0.041 0.050 0.072 0.201 0.273 0.274 0.050 0.061 0.087 0.229 0.301 0.302 BLIP 0.556 0.562 0.563 0.561 0.577 0.623 0.495 0.500 0.524 0.544 0.558 0.600 MINOBS 0.557 0.563 0.574 0.566 0.588 0.630 0.496 0.502 0.533 0.549 0.569 0.606 REAL 0.646 0.607 0.951 0.951 0.951 0.951 0.169 0.542 0.675 0.675 0.675 0.675 PC* 0.006 0.014 0.016 0.020 0.032 0.273 0.007 0.017 0.020 0.025 0.040 0.288 BLIP 0.755 0.755 0.767 0.713 0.700 0.716 0.645 0.645 0.661 0.641 0.636 0.651 MINOBS 0.758 0.753 0.743 0.681 0.679 0.684 0.648 0.644 0.641 0.615 0.618 0.622 REAL 0.481 0.660 0.572 0.941 0.941 0.941 0.081 0.535 0.523 0.677 0.677 0.677 PC* 0.004 0.007 0.007 0.008 0.009 0.015 0.005 0.009 0.009 0.011 0.011 0.019 BLIP 0.713 0.721 0.729 0.654 0.647 0.668 0.600 0.606 0.613 0.575 0.578 0.600 MINOBS 0.715 0.699 0.709 0.661 0.650 0.661 0.601 0.588 0.596 0.582 0.581 0.594 REAL 0.561 0.731 0.598 0.930 0.930 0.930 0.104 0.589 0.534 0.669 0.669 0.669 PC* 0.004 0.006 0.006 0.007 0.007 0.011 0.005 0.007 0.008 0.008 0.009 0.014 BLIP 0.793 0.772 0.788 0.720 0.687 0.677 0.679 0.660 0.674 0.645 0.624 0.618 MINOBS 0.765 0.768 0.772 0.689 0.670 0.666 0.654 0.657 0.660 0.620 0.609 0.607 REAL 0.444 0.624 0.595 0.950 0.950 0.950 0.085 0.507 0.541 0.676 0.676 0.676 PC* 0.005 0.006 0.007 0.008 0.009 0.019 0.006 0.008 0.008 0.010 0.012 0.023 BLIP 0.690 0.703 0.695 0.654 0.641 0.658 0.579 0.591 0.584 0.575 0.570 0.588 MINOBS 0.695 0.706 0.691 0.644 0.653 0.654 0.584 0.593 0.580 0.567 0.582 0.585

Precision F0.5

hepar2 70/123

munin4 1038/1388

munin3 1041/1306

munin 1041/1397

munin1 186/273

andes 223/338

diabetes 413/602

pigs 441/592

link 724/1125

munin2 1003/1244

Figure 9. Precision/F0.5 on 12 large-scale benchmark datasets (in first column, 70/123 means the network has 70 nodes and 123 edges)

time(s) Precision Recall F0.5 FD 176 0.971 0.518 0.827 No FD 208 0.970 0.503 0.818 FD 703 0.846 0.447 0.718 No FD 760 0.834 0.409 0.690

#edges missed by BLIP/MINOBS

performance of REAL dataset config #unique FD edges

Figure 10. Effectiveness and Efficiency of FD theory

Finding unique FD edges. The right-most column in Table 1 shows unique FD edges missed by BLIP and MINOBS. We omit results from PC* since Out-Of-Memory incurred when running on Bird Strikes and Flight Delays. BLIP and MINOBS produce identical results on identifying unique FD edges, i.e., at least one third of the unique FD edges are missed in their results, which indicates the learned DAG either violates global Markov property or is not minimal. E.g., in School dataset, there exists two FDs in three variables school_code, school_name and borough: school_code school_name, and school_code borough. This makes sense since school code is a unique ID for a specific school, which locates at a specific borough and with a fixed school name. But there exists different schools with same school name, thus these two FDs satisfy uniqueness property which implies a conditional independence: given school_code, school_name and borough are independent. However, BLIP and MINOBS don t find the edge from school_code to school name, thus their result is not Markovian. Skeleton of Bird Strikes dataset. Bird Strikes dataset contains the record for aero planes be struck by birds. Besides the regular information of each record such as date, time, location etc., there are two important indicator variables Effect and Damage Level . Effect records what actions taken when the bird strike occurred, such as Aborted Take-off / Engine Shutdown / Precautionary Landing / , and Damage Level records the level of the damage such as High / Medium / Low / None . Besides, there also 14 Boolean variables indicate which specific part is struck by birds such as Strike Engine / Strike Propeller / etc., and 14 Boolean variables indicate which specific part is damaged by the strike such as Damage Engine / Damage Propeller / etc. According to the feedback from domain experts, Effect should be the direct effect of Strike X , because whenever a strike occurred, pilot or tower controllers must immediately take action regardless of whether damage is incurred or not, and the action is recorded in the Effect column. Therefore, in the learned skeleton, we would expect to see edges between Effect and Strike X . REAL identifies 7 out of 14 Strike X variables as neighbors of Effect , but BLIP or MINOBS only identifies 1. In comparison, REAL s result is more satisfactory according to the experts feedback.

Related Work

The motivation for SL largely stems from the resurgence of interest in causality in recent years (Pearl 2009, Shanmugam et. al. 2015; Peters, Janzing and Schölkopf 2017). As a standard approach to building causal graphs for reasoning, a skeleton is first constructed, and human interventions are then applied to determine the edge direc-

tions. Whereas many recent studies focus on how to minimize the cost for human intervention (Shanmugam et. al. 2015; Kocaoglu, Dimakis and Vishwanath 2017; Lindgren et. al. 2018), the problem of how to reliably and efficiently build the skeleton for intervention has been largely overlooked. Independence-based BNSL methods generally use statistical tests to determine whether two variables are independent or not. These include PC (Spirtes et. al. 2001), MMPC (Tsamardinos, Brown and Aliferis 2006), etc. Specifically, PC starts from a fully connected skeleton and iteratively removes edges, which violates reliability. The skeleton growing in MMPC is not conducted in a best-first way, and its procedure for inserting edges is not robust thus false positives are largely included. Score-based methods is another family of BNSL that can be used for SL. They fit a Bayesian network to the data by optimizing certain metrics, such as the posterior probability of a network given data (Cooper and Herskovits 1992). These include GOBNILP (Cussens 2011), MINOBS (Lee and van Beek 2017), A* (Yuan, Malone and Wu 2011), BLIP (Scanagatta et. al. 2015), etc. Since the metrics used by score-based methods are designed for BNSL and not tailored for SL, they may produce sub-optimal skeletons. Anytime algorithms for BNSL have also been studied recently in the literature (Malone and Yuan 2013, Lee and van Beek 2017), several anytime score-based methods for BNSL have been investigated by previous works (Campos and Ji 2011; Cussens 2011; Fan 2015; Lee and van Beek 2017; Jaakkola, Sontag, and Globerson 2010). However, these methods do not explicitly address reliability concerns for SL and may add spurious edges or delete true edges over the iterations. BLIP (Scanagatta et. al. 2015) also supports an anytime mode but it uses all time budgets allocated and cannot be interrupted to return an intermediate skeleton during the execution. To deal with FD relationships for SL, Scheines et al. 1996 suggests to ignore the variables that can be functionally determined by others. As we pointed out, FDs exhibit strong dependency, which is useful for further tasks; Mabrouk et al. 2014 proposes an efficient algorithm for BNSL when data contains FDs, however, the algorithm assumes all Z are parents of X in learned BN when Z can determine X, thus the learned BN is not harmonious since minimality could be violated.

Conclusions

In this paper, we present REAL, a novel Reliable and Efficient Anytime skeleton Learning algorithm. REAL first resolves incompatibility between FD and faithfulness assumption. Based on this, REAL conducts SL on a reduced set of variables with guaranteed correctness thus drastically

improves efficiency. Furthermore, it employs a novel edgeinsertion and best-first strategy in anytime fashion for skeleton growing to achieve high reliability and efficiency. Experimental results on benchmark and real-world datasets demonstrate that REAL significantly outperforms the other state-of-the-art algorithms.

Chen, H.; Chiang, R.H.; and Storey, V.C. 2012. Business intelligence and analytics: From big data to big impact. MIS quarterly, pp.1165-1188. Chickering, D. M. 1996. Learning Bayesian networks is NPcomplete. In Learning from data (pp. 121-130). Springer, New York, NY. Lee, C., and van Beek, P. 2017, September. An experimental analysis of anytime algorithms for Bayesian network structure learning. In Advanced Methodologies for Bayesian Networks (pp. 69-80). Cooper, G. F., and Herskovits, E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4), pp.309-347. Cussens, J. 2012. Bayesian network learning with cutting planes. ar Xiv preprint ar Xiv:1202.3713. Daly, R.; Shen, Q.; and Aitken, S. 2011. Learning Bayesian networks: approaches and issues. The knowledge engineering review, 26(2), pp.99-157. Campos, C. P. D., and Ji, Q. 2011. Efficient structure learning of Bayesian networks using constraints. Journal of Machine Learning Research, 12(Mar), pp.663-689. Fan, X., and Yuan, C. 2015, March. An improved lower bound for bayesian network structure learning. In Twenty-ninth AAAI conference on artificial intelligence. Jaakkola, T.; Sontag, D.; Globerson, A.; and Meila, M. 2010, March. Learning Bayesian network structure using LP relaxations. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 358-365). Lou, J. G.; Lin, Q.; Ding, R.; Fu, Q.; Zhang, D.; and Xie, T. 2017. Experience report on applying software analytics in incident management of online service. Automated Software Engineering, 24(4), pp.905-941. Kocaoglu, M.; Dimakis, A.; and Vishwanath, S. 2017, August. Cost-optimal learning of causal graphs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1875-1884). JMLR. org. Lee, C., and van Beek, P. 2017, May. Metaheuristics for scoreand-search Bayesian network structure learning. In Canadian Conference on Artificial Intelligence (pp. 129-141). Springer, Cham. Lindgren, E.; Kocaoglu, M.; Dimakis, A. G.; and Vishwanath, S. 2018. Experimental design for cost-aware learning of causal graphs. In Advances in Neural Information Processing Systems (pp. 5279-5289). Mabrouk, A.; Gonzales, C.; Jabet-Chevalier, K.; and Chojnacki, E. 2014, August. An Efficient Bayesian Network Structure Learning Algorithm in the Presence of Deterministic Relations. In ECAI (Vol. 14, pp. 567-572).

Malone, B., and Yuan, C. 2013. Evaluating anytime algorithms for learning optimal Bayesian networks. ar Xiv preprint ar Xiv:1309.6844. Margaritis, D., and Thrun, S. 2000. Bayesian network induction via local neighborhoods. In Advances in neural information processing systems (pp. 505-511). Pearl, J. 2009. Causality. Cambridge university press. Peters, J.; Janzing, D.; and Schölkopf, B. 2017. Elements of causal inference: foundations and learning algorithms. MIT press. Sasaki, Y. 2007. The truth of the F-measure. Teach Tutor mater, 1(5), 1-5. Scanagatta, M.; de Campos, C. P.; Corani, G.; and Zaffalon, M. 2015. Learning Bayesian networks with thousands of variables. In Advances in neural information processing systems (pp. 18641872). Scheines, R.: Spirtes, P.; Glymour, C.; Meek, C.; and Richardson, T. 1996. TETRAD 3: Tools for Causal Modeling User s Manual. CMU Philosophy. Scutari, M. 2012. Bayesian Network Repository. http://www.bnlearn.com/bnrepository Shanmugam, K.; Kocaoglu, M.; Dimakis, A. G.; and Vishwanath, S. 2015. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems (pp. 31953203). Spirtes, P.; Glymour, C. N.; Scheines, R.; and Heckerman, D. 2000. Causation, prediction, and search. MIT press. Tsamardinos, I.; Brown, L. E.; and Aliferis, C. F. 2006. The maxmin hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1), pp.31-78. Vardi, M. Y. 1985. Fundamentals of dependency theory. IBM Thomas J. Watson Research Division. Yuan, C.; Malone, B.; and Wu, X. 2011, June. Learning optimal Bayesian networks using A* search. In Twenty-Second International Joint Conference on Artificial Intelligence.