# tractable_uncertainty_for_structure_learning__ba202ecf.pdf Tractable Uncertainty for Structure Learning Benjie Wang 1 Matthew Wicker 1 Marta Kwiatkowska 1 Bayesian structure learning allows one to capture uncertainty over the causal directed acyclic graph (DAG) responsible for generating given data. In this work, we present Tractable Uncertainty for STructure learning (TRUST), a framework for approximate posterior inference that relies on probabilistic circuits as the representation of our posterior belief. In contrast to sample-based posterior approximations, our representation can capture a much richer space of DAGs, while also being able to tractably reason about the uncertainty through a range of useful inference queries. We empirically show how probabilistic circuits can be used as an augmented representation for structure learning methods, leading to improvement in both the quality of inferred structures and posterior uncertainty. Experimental results on conditional query answering further demonstrate the practical utility of the representational capacity of TRUST. 1. Introduction Understanding the causal and probabilistic relationship between variables of underlying data-generating processes can be a vital step in many scientific inquiries. Such systems are often represented by causal Bayesian networks (BNs), probabilistic models with structure expressed using a directed acyclic graph (DAG). The basic task of structure learning is to identify the underlying BN from a set of observational data, which, if successful, can provide useful insights about the relationships between random variables and the effects of potential interventions. However, even under strong assumptions such as causal sufficiency and faithfulness, it is typically impossible to identify a single causal DAG from purely observational data. Further, while consistent methods exist for producing a point estimate DAG in the limit of infinite data (Chickering, 2002), in practice, when data 1Department of Computer Science, University of Oxford, Oxford, United Kingdom. Correspondence to: Benjie Wang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). is scarce many BNs can fit the data well. It thus becomes vitally important to quantify the uncertainty over causal structures, particularly in safety-critical scenarios. Bayesian methods for structure learning tackle this problem by defining a prior and likelihood over DAGs, such that the posterior distribution can be used to reason about the uncertainty surrounding the learned causal edges, for instance by performing Bayesian model averaging. Unfortunately, the super-exponential space of DAGs makes both representing and learning such a posterior extremely challenging. A major breakthrough was the introduction of order-based representations (Friedman & Koller, 2003), in which the state space is reduced to the space of topological orders. Unfortunately, the number of possible orders is still factorial in the dimension d, making it infeasible to represent the posterior as a tabular distribution over orders. Approximate Bayesian structure learning methods have thus mostly sought to approximate the distribution using samples of DAGs or orders (Lorch et al., 2021; Agrawal et al., 2018). However, such sample-based representations have very limited coverage of the posterior, restricting the information they can provide. Consider, for instance, the problem of finding the most probable graph extension, given an arbitrary set of required edges. Given the super-exponential space, even a large sample may not contain even a single order consistent with the given set of edges, making answering such a query impossible. A natural question, therefore, is whether it is possible to more compactly represent distributions over orders (and thus DAGs) while retaining the ability to perform useful inference queries tractably (in the size of the representation). We answer in the affirmative, by proposing a novel representation, Order SPNs, for distributions over orders and graphs. Under the assumption of order-modularity, we show that Order SPNs form a natural and flexible approximation to the target distribution. The key component is the encoding of hierarchical conditional independencies into the form of a sum-product network (SPN) (Poon & Domingos, 2011), a well-known type of tractable probabilistic circuit. Based on this, we develop an approximate Bayesian structure learning framework, TRUST, for efficiently querying Order SPNs and learning them from data. Empirical results corroborate the increased representational capacity and coverage Tractable Uncertainty for Structure Learning of TRUST, while also demonstrating improved performance compared to competing methods on standard metrics. Our contributions are as follows: We introduce a novel representation, Order SPNs, for Bayesian structure learning based on sum-product networks. In particular, we exploit exact hierarchical conditional independencies present in order-modular distributions. This allows Order SPNs to express distributions over a potentially exponentially larger set of orders relative to their size. We show that Order SPNs satisfy desirable properties that enable tractable and exact inference. In particular, we present methods for computation of a range of useful inference queries in the context of structure learning, including marginal and conditional edge probabilities, graph sampling, maximal probability graph completions, and pairwise causal effects. We further provide complexity results for these queries; notably, all take at most linear time in the size of the circuit. We demonstrate how our method, TRUST, can be used to approximately learn a posterior over DAG structures given observational data. In particular, we utilize a two-step procedure, in which we (i) propose a structure for the SPN using a seed sampler; and (ii) optimize the parameters of the SPN in a variational inference scheme. Crucially, the tractable properties of the circuit enable the ELBO and its gradients to be computed exactly without sampling. 2. Related Work Bayesian approaches to structure learning infer a distribution over possible causal graphs. Such distributions can then be queried to extract useful information, such as estimating causal effects, which can aid investigators in understanding the domain, or to plan interventions (Castelletti & Consonni, 2021; Maathuis et al., 2010; Viinikka et al., 2020). Unfortunately, due to the super-exponential space, exact Bayesian inference methods for structure learning do not scale beyond d = 20 (Koivisto & Sood, 2004; Koivisto, 2006). As a result, there has been much interest in approximate methods, most notably performing MCMC sampling over the space of DAGs (Madigan et al., 1995; Giudici & Castelo, 2003). Notable works in this direction include those by Friedman & Koller (2003), who operate over the much smaller space and smoother posterior landscape of topological orders, Tsamardinos et al. (2006), who reduce the state-space by considering conditional independence, and Kuipers et al. (2018), who reduce the per-step computational cost associated with scoring. Alternatively, some recent works have applied variational inference to the Bayesian structure learning problem, where an approximate distribution over graphs is obtained by optimizing over some variational family describing distributions over graphs. Unfortunately, existing representations are typically not very tractable; Annadani et al. (2021); Cundy et al. (2021) utilize neural autoregressive and energybased models respectively, while Lorch et al. (2021) employ sample-based approximations and particle variational inference (Liu & Wang, 2016). This presents significant challenges for gradient-based optimization, since the reparameterization trick is not applicable for the discrete space of graphs. Further, downstream inference queries can only be estimated approximately through sampling. In contrast, our proposed variational family based on tractable models makes optimization and inference exact and efficient. Probabilistic circuits (Choi et al., 2020) are a general class of tractable probabilistic models which represent distributions using computational graphs. The key advantage of circuits, compared to other probabilistic models such as Bayesian networks, VAEs (Kingma & Welling, 2013), or GANs (Goodfellow et al., 2014), is their ability to perform tractable and exact inference, for instance, computing marginal probabilities. While the typical use case is to learn a distribution over a set of variables from data (Gens & Pedro, 2013; Rooshenas & Lowd, 2014), in this work we consider learning a circuit to approximate a given (intractable) posterior distribution over the space of DAGs, thus requiring different structure and parameter learning routines. 3. Background 3.1. Bayesian Structure Learning Bayesian Networks A Bayesian network (BN) N = (G, Θ) is a probabilistic model p(X) over d variables X = {X1, ...Xd}, specified using the directed acyclic graph (DAG) G, which encodes conditional independencies in the distribution p, and Θ, which parameterizes the mechanisms (conditional probability distributions) constituting the Bayesian network. The conditional probabilities take the form p(Xi|pa G(Xi), Θi), giving rise to the joint data distribution: p(X|G, Θ) = Y i p(Xi|pa G(Xi), Θi) where pa G(X) denotes the parents of X in G. One of the most popular types of BN model is the linear Gaussian model, under which the distribution is given by the structural equation X = XB+ϵ, where B Rd d is a matrix of real weights parameterizing the mechanisms, and ϵ N(b, Σ) where b Rd and Σ Rd d 0 is a diagonal matrix of noise variances. In particular, for a given DAG G, we have Bij = 0 for all i, j such that i is not a parent of j in G. Whereas Bayesian networks typically only express probabilistic (conditional independence) information, causal Tractable Uncertainty for Structure Learning Bayesian networks (Spirtes et al., 2000; Pearl, 2009) are additionally imbued with a causal interpretation, where, intuitively, the directed edges in G represent direct causation. More formally, causal BNs can predict the effect (change in joint distribution) of interventions in the system, where some mechanism is changed, for instance by setting a variable X to some value x independent of its parents. Bayesian Structure Learning Structure learning (Koller & Friedman, 2009; Glymour et al., 2019) is the problem of learning the DAG G of the (causal) Bayesian network responsible for generating some given data D. Typically, strong assumptions are required for structure learning; in this work, we make the common assumption of causal sufficiency, meaning that there are no latent (unobserved) confounders. Even given this assumption, it is often not possible to reliably infer the causal DAG, whether due to limited data, or non-identifiability within a Markov equivalence class. Instead of learning a single DAG, Bayesian approaches to structure learning express uncertainty over structures in a unified fashion, through defining a prior ppr(G) and (marginal) likelihood plh(D|G) over directed graphs G. A common assumption is that the prior and likelihood scores are modular, that is, they decompose into a product of terms for each mechanism Gi of the graph, where Gi specifies the set of parents of variable i in G. In such cases, the overall posterior decomposes as: p G(G|D) 1DAG(G)ppr(G)plh(D|G) = 1DAG(G) Y i ppr(Gi)plh(Di|Gi) The acyclicity constraint 1DAG(G) induces correlations between different mechanisms and presents the key computational challenge for posterior inference. The prior and likelihood can be chosen based on knowledge about the domain; for example, for linear Gaussian models, we can employ the BGe score (Kuipers et al., 2014), a closed form expression for the marginal likelihood of a variable given its parent set (marginalizing over weights of the linear model). The prior is typically chosen to penalize larger parent sets. 3.2. Sum-Product Networks Sum-product networks (SPN) are probabilistic circuits over a set of variables V , represented using a rooted DAG consisting of three types of nodes: leaf, sum and product nodes. These nodes can each be viewed as representing a distribution over some subset of variables W V , where the root node specifies an overall distribution qφ(V ). Each leaf node L specifies an input distribution over some subset of variables W V , which is assumed to be tractable. Each product node P multiplies the distributions given by its children, i.e., P = Q Ci ch(P ) Ci, while each sum node is defined by a weighted sum of its children, i.e., T = P Ci ch(S) φi Ci. The weights φi for each sum node satisfy φi > 0, P φi = 1, and are referred to as the parameters of the SPN. The scope of a node N denotes the set of variables N specifies a distribution over, and can be defined recursively as follows. Each leaf node N has scope sc(N) = {V }, where V is the variable it specifies its distribution over, and each product or sum node N has scope sc(N) = C ch(N)sc(C). SPNs provide a computationally convenient representation of probability distributions, enabling efficient and exact inference for many types of queries, given certain structural properties (Poon & Domingos, 2011; Peharz et al., 2014): A SPN is complete if, for every sum node T, and any two children C1, C2 of T, it holds that sc(C1) = sc(C2). In other words, all the children of T, and thus T itself, have the same scope. A SPN is decomposable if, for each product node P, and any two children C1, C2 of P, it holds that sc(C1) sc(C2) = . In other words, the scope of P is partitioned by its children. A SPN is deterministic if, for each sum node T, and any instantiation w of its scope sc(T) = W , at most one of its children Ci(w) evaluates to a non-zero probability. Given completeness and decomposability, marginal inference becomes tractable, that is, we can compute qφ(W ) for any W V in linear time in the number of edges of the SPN. Conditional probabilities can be computed as the ratio of two marginal probabilities. If the SPN additionally satisfies determinism, MPE inference, i.e., maxv:W =w qφ(v), also becomes tractable (Peharz et al., 2017). 4. Tractable Representations for Bayesian Structure Learning In this work, we consider Bayesian structure learning over the joint space of topological orders and DAGs, where each order σ is a permutation of {1, ..., d}. Let σ 1, and has scope sc(T) = (σS2, GS2). It has KT children and weights φT,i for i = 1, ..., KT , where the ith child is a product node P associated with (S1, S21,i, S22,i) for some partition (S21,i, S22,i) of S2. Each product node P is associated with three disjoint subsets (S1, S21, S22) of {1, ...d}, and has scope sc(P) = (σS21 S22, GS21 S22), where σS21 S22 takes the form (σS21, σS22). It has two children, where the first child is associated with (S1, S21), and the second with (S1 S21, S22). These children are either sum-nodes or leaves. Tractable Uncertainty for Structure Learning {1, 2, 3, 4} + + + + + + {1, 2} {1, 2} {3, 4} {2, 3} {1, 4} {1, 4} {2, 3} L L L L L L L L {1} 0.4 0.15 0.45 0.7 0.3 0.5 0.5 (a) Regular Order SPN, with expansion factors K = (3, 2). Each sum/leaf node is labelled with its associated (S1, S2). Only one expansion beyond the first level is shown for clarity. (S1, S2) Example Orders Example Graph ( , {1, 2, 3, 4}) (1, 2, 4, 3) (2, 3, 1, 4) (4, 1, 3, 2) ({1, 2}, {3, 4}) (3, 4) (4, 3) ({1, 2}, {4}) (4) (b) Example orders and graphs for 3 sum/leaf nodes. Graphs only include parent sets of S2 (filled) variables. Figure 1. Example of regular Order SPN for d = 4. Best viewed in color. We can interpret each sum (or leaf) node T associated with (S1, S2) as representing a distribution over DAGs over variables S2, where these variables can additionally have parents from among S1. In other words, every sum node represents a (smaller) Bayesian structure learning problem over a set of variables S2 and a set of potential confounders S1. In practice, we organize the SPN into alternating layers of sum and product nodes, starting with the root sum node. In the jth sum layer, we create a fixed number Kj of children for each sum node T in the layer. Further, for each child i of each sum node T, we choose (S21,i, S22,i) such that |S21,i|= |S2| 2 , |S22,i|= |S2| 2 , and further require that the partitions are distinct for different children i of T. Under these conditions, the Order SPN will have log2(d) sum (and product) layers. This ensures compactness of the representation, and enables efficient tensorized computation over layers. We call such Order SPNs regular, and the associated list K of numbers of children for each layer are called the expansion factors. An example of a regular Order SPN is shown in Figure 1. At the top sum layer, we create a child for K1 = 3 different partitions of {1, 2, 3, 4} into equally sized subsets, each of which has an associated weight. Sum and product layers alternate until we reach the leaf nodes. The leaf nodes L represent distributions over some column of the graph: if L is associated with (S1, i), then it expresses a distribution over the parents Gi of variable i. The interpretation of S1 is that this distribution should only have support over sets Gi S1. This restriction ensures that Order SPNs are consistent, in the sense that they represent distributions over valid (σ, G) pairs (in particular, all graphs are acyclic): Proposition 2. Let qφ be an Order SPN. Then, for all pairs (σ, G) in the support of an Order SPN, it holds that G |= σ. By design, (regular) Order SPNs satisfy the standard SPN properties that make then an efficient representation for inference, which we show in the following Proposition. In the following sections, we use these properties for query computation, as well as for learning the SPN parameters. Proposition 3. Any Order SPN is complete and decomposable, and regular Order SPNs are additionally deterministic. 4.3. Leaf Distributions Given a leaf node associated with (S1, i), corresponding to a distribution on Gi, the only restriction imposed by the definition is that the distribution has support only on graphs Gi S1. Given that we are approximating an order-modular distribution p(σ, G) Q i p Gi(Gi)1Gi σ