# tractable_uncertainty_for_structure_learning__ba202ecf.pdf

Tractable Uncertainty for Structure Learning

Benjie Wang 1 Matthew Wicker 1 Marta Kwiatkowska 1

Bayesian structure learning allows one to capture uncertainty over the causal directed acyclic graph (DAG) responsible for generating given data. In this work, we present Tractable Uncertainty for STructure learning (TRUST), a framework for approximate posterior inference that relies on probabilistic circuits as the representation of our posterior belief. In contrast to sample-based posterior approximations, our representation can capture a much richer space of DAGs, while also being able to tractably reason about the uncertainty through a range of useful inference queries. We empirically show how probabilistic circuits can be used as an augmented representation for structure learning methods, leading to improvement in both the quality of inferred structures and posterior uncertainty. Experimental results on conditional query answering further demonstrate the practical utility of the representational capacity of TRUST.

1. Introduction

Understanding the causal and probabilistic relationship between variables of underlying data-generating processes can be a vital step in many scientiﬁc inquiries. Such systems are often represented by causal Bayesian networks (BNs), probabilistic models with structure expressed using a directed acyclic graph (DAG). The basic task of structure learning is to identify the underlying BN from a set of observational data, which, if successful, can provide useful insights about the relationships between random variables and the effects of potential interventions. However, even under strong assumptions such as causal sufﬁciency and faithfulness, it is typically impossible to identify a single causal DAG from purely observational data. Further, while consistent methods exist for producing a point estimate DAG in the limit of inﬁnite data (Chickering, 2002), in practice, when data

1Department of Computer Science, University of Oxford, Oxford, United Kingdom. Correspondence to: Benjie Wang <benjie.wang@cs.ox.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

is scarce many BNs can ﬁt the data well. It thus becomes vitally important to quantify the uncertainty over causal structures, particularly in safety-critical scenarios.

Bayesian methods for structure learning tackle this problem by deﬁning a prior and likelihood over DAGs, such that the posterior distribution can be used to reason about the uncertainty surrounding the learned causal edges, for instance by performing Bayesian model averaging. Unfortunately, the super-exponential space of DAGs makes both representing and learning such a posterior extremely challenging. A major breakthrough was the introduction of order-based representations (Friedman & Koller, 2003), in which the state space is reduced to the space of topological orders.

Unfortunately, the number of possible orders is still factorial in the dimension d, making it infeasible to represent the posterior as a tabular distribution over orders. Approximate Bayesian structure learning methods have thus mostly sought to approximate the distribution using samples of DAGs or orders (Lorch et al., 2021; Agrawal et al., 2018). However, such sample-based representations have very limited coverage of the posterior, restricting the information they can provide. Consider, for instance, the problem of ﬁnding the most probable graph extension, given an arbitrary set of required edges. Given the super-exponential space, even a large sample may not contain even a single order consistent with the given set of edges, making answering such a query impossible.

A natural question, therefore, is whether it is possible to more compactly represent distributions over orders (and thus DAGs) while retaining the ability to perform useful inference queries tractably (in the size of the representation). We answer in the afﬁrmative, by proposing a novel representation, Order SPNs, for distributions over orders and graphs. Under the assumption of order-modularity, we show that Order SPNs form a natural and ﬂexible approximation to the target distribution. The key component is the encoding of hierarchical conditional independencies into the form of a sum-product network (SPN) (Poon & Domingos, 2011), a well-known type of tractable probabilistic circuit. Based on this, we develop an approximate Bayesian structure learning framework, TRUST, for efﬁciently querying Order SPNs and learning them from data. Empirical results corroborate the increased representational capacity and coverage

Tractable Uncertainty for Structure Learning

of TRUST, while also demonstrating improved performance compared to competing methods on standard metrics. Our contributions are as follows:

We introduce a novel representation, Order SPNs, for Bayesian structure learning based on sum-product networks. In particular, we exploit exact hierarchical conditional independencies present in order-modular distributions. This allows Order SPNs to express distributions over a potentially exponentially larger set of orders relative to their size.

We show that Order SPNs satisfy desirable properties that enable tractable and exact inference. In particular, we present methods for computation of a range of useful inference queries in the context of structure learning, including marginal and conditional edge probabilities, graph sampling, maximal probability graph completions, and pairwise causal effects. We further provide complexity results for these queries; notably, all take at most linear time in the size of the circuit.

We demonstrate how our method, TRUST, can be used to approximately learn a posterior over DAG structures given observational data. In particular, we utilize a two-step procedure, in which we (i) propose a structure for the SPN using a seed sampler; and (ii) optimize the parameters of the SPN in a variational inference scheme. Crucially, the tractable properties of the circuit enable the ELBO and its gradients to be computed exactly without sampling.

2. Related Work

Bayesian approaches to structure learning infer a distribution over possible causal graphs. Such distributions can then be queried to extract useful information, such as estimating causal effects, which can aid investigators in understanding the domain, or to plan interventions (Castelletti & Consonni, 2021; Maathuis et al., 2010; Viinikka et al., 2020). Unfortunately, due to the super-exponential space, exact Bayesian inference methods for structure learning do not scale beyond d = 20 (Koivisto & Sood, 2004; Koivisto, 2006). As a result, there has been much interest in approximate methods, most notably performing MCMC sampling over the space of DAGs (Madigan et al., 1995; Giudici & Castelo, 2003). Notable works in this direction include those by Friedman & Koller (2003), who operate over the much smaller space and smoother posterior landscape of topological orders, Tsamardinos et al. (2006), who reduce the state-space by considering conditional independence, and Kuipers et al. (2018), who reduce the per-step computational cost associated with scoring.

Alternatively, some recent works have applied variational inference to the Bayesian structure learning problem, where

an approximate distribution over graphs is obtained by optimizing over some variational family describing distributions over graphs. Unfortunately, existing representations are typically not very tractable; Annadani et al. (2021); Cundy et al. (2021) utilize neural autoregressive and energybased models respectively, while Lorch et al. (2021) employ sample-based approximations and particle variational inference (Liu & Wang, 2016). This presents signiﬁcant challenges for gradient-based optimization, since the reparameterization trick is not applicable for the discrete space of graphs. Further, downstream inference queries can only be estimated approximately through sampling. In contrast, our proposed variational family based on tractable models makes optimization and inference exact and efﬁcient.

Probabilistic circuits (Choi et al., 2020) are a general class of tractable probabilistic models which represent distributions using computational graphs. The key advantage of circuits, compared to other probabilistic models such as Bayesian networks, VAEs (Kingma & Welling, 2013), or GANs (Goodfellow et al., 2014), is their ability to perform tractable and exact inference, for instance, computing marginal probabilities. While the typical use case is to learn a distribution over a set of variables from data (Gens & Pedro, 2013; Rooshenas & Lowd, 2014), in this work we consider learning a circuit to approximate a given (intractable) posterior distribution over the space of DAGs, thus requiring different structure and parameter learning routines.

3. Background

3.1. Bayesian Structure Learning

Bayesian Networks A Bayesian network (BN) N = (G, Θ) is a probabilistic model p(X) over d variables X = {X1, ...Xd}, speciﬁed using the directed acyclic graph (DAG) G, which encodes conditional independencies in the distribution p, and Θ, which parameterizes the mechanisms (conditional probability distributions) constituting the Bayesian network. The conditional probabilities take the form p(Xi|pa G(Xi), Θi), giving rise to the joint data distribution:

p(X|G, Θ) = Y

i p(Xi|pa G(Xi), Θi)

where pa G(X) denotes the parents of X in G. One of the most popular types of BN model is the linear Gaussian model, under which the distribution is given by the structural equation X = XB+ϵ, where B Rd d is a matrix of real weights parameterizing the mechanisms, and ϵ N(b, Σ) where b Rd and Σ Rd d 0 is a diagonal matrix of noise variances. In particular, for a given DAG G, we have Bij = 0 for all i, j such that i is not a parent of j in G.

Whereas Bayesian networks typically only express probabilistic (conditional independence) information, causal

Tractable Uncertainty for Structure Learning

Bayesian networks (Spirtes et al., 2000; Pearl, 2009) are additionally imbued with a causal interpretation, where, intuitively, the directed edges in G represent direct causation. More formally, causal BNs can predict the effect (change in joint distribution) of interventions in the system, where some mechanism is changed, for instance by setting a variable X to some value x independent of its parents.

Bayesian Structure Learning Structure learning (Koller & Friedman, 2009; Glymour et al., 2019) is the problem of learning the DAG G of the (causal) Bayesian network responsible for generating some given data D. Typically, strong assumptions are required for structure learning; in this work, we make the common assumption of causal sufﬁciency, meaning that there are no latent (unobserved) confounders. Even given this assumption, it is often not possible to reliably infer the causal DAG, whether due to limited data, or non-identiﬁability within a Markov equivalence class. Instead of learning a single DAG, Bayesian approaches to structure learning express uncertainty over structures in a uniﬁed fashion, through deﬁning a prior ppr(G) and (marginal) likelihood plh(D|G) over directed graphs G.

A common assumption is that the prior and likelihood scores are modular, that is, they decompose into a product of terms for each mechanism Gi of the graph, where Gi speciﬁes the set of parents of variable i in G. In such cases, the overall posterior decomposes as:

p G(G|D) 1DAG(G)ppr(G)plh(D|G)

= 1DAG(G) Y

i ppr(Gi)plh(Di|Gi)

The acyclicity constraint 1DAG(G) induces correlations between different mechanisms and presents the key computational challenge for posterior inference. The prior and likelihood can be chosen based on knowledge about the domain; for example, for linear Gaussian models, we can employ the BGe score (Kuipers et al., 2014), a closed form expression for the marginal likelihood of a variable given its parent set (marginalizing over weights of the linear model). The prior is typically chosen to penalize larger parent sets.

3.2. Sum-Product Networks

Sum-product networks (SPN) are probabilistic circuits over a set of variables V , represented using a rooted DAG consisting of three types of nodes: leaf, sum and product nodes. These nodes can each be viewed as representing a distribution over some subset of variables W V , where the root node speciﬁes an overall distribution qφ(V ). Each leaf node L speciﬁes an input distribution over some subset of variables W V , which is assumed to be tractable. Each product node P multiplies the distributions given by its children, i.e., P = Q

Ci ch(P ) Ci, while each sum node is deﬁned by

a weighted sum of its children, i.e., T = P

Ci ch(S) φi Ci. The weights φi for each sum node satisfy φi > 0, P φi = 1, and are referred to as the parameters of the SPN. The scope of a node N denotes the set of variables N speciﬁes a distribution over, and can be deﬁned recursively as follows. Each leaf node N has scope sc(N) = {V }, where V is the variable it speciﬁes its distribution over, and each product or sum node N has scope sc(N) = C ch(N)sc(C).

SPNs provide a computationally convenient representation of probability distributions, enabling efﬁcient and exact inference for many types of queries, given certain structural properties (Poon & Domingos, 2011; Peharz et al., 2014):

A SPN is complete if, for every sum node T, and any two children C1, C2 of T, it holds that sc(C1) = sc(C2). In other words, all the children of T, and thus T itself, have the same scope.

A SPN is decomposable if, for each product node P, and any two children C1, C2 of P, it holds that sc(C1) sc(C2) = . In other words, the scope of P is partitioned by its children.

A SPN is deterministic if, for each sum node T, and any instantiation w of its scope sc(T) = W , at most one of its children Ci(w) evaluates to a non-zero probability.

Given completeness and decomposability, marginal inference becomes tractable, that is, we can compute qφ(W ) for any W V in linear time in the number of edges of the SPN. Conditional probabilities can be computed as the ratio of two marginal probabilities. If the SPN additionally satisﬁes determinism, MPE inference, i.e., maxv:W =w qφ(v), also becomes tractable (Peharz et al., 2017).

4. Tractable Representations for Bayesian Structure Learning

In this work, we consider Bayesian structure learning over the joint space of topological orders and DAGs, where each order σ is a permutation of {1, ..., d}. Let σ<i be the set of variables preceding variable i in σ. We say that a parent set Gi is consistent with an order σ if Gi σ<i, and that graph G is consistent if all of its parent sets are consistent (written G |= σ). It follows that any DAG is consistent with at least one order, and further any directed graph consistent with an order must be acyclic. Thus we can specify a joint distribution over orders and DAGs as follows:

p(σ, G|D) p G(G|D)1G|=σ

= ppr(G)plh(D|G) Y

Notice that the marginal p(G|D) is not the same as p G(G|D), as p will favour graphs which are consistent with

Tractable Uncertainty for Structure Learning

more orders. This imparts a bias for learning with respect to p G. On the other hand, the space of orders is much smaller than the space of DAGs, enabling more efﬁcient exploration of the distribution (Friedman & Koller, 2003).

In the case where the prior and likelihood are modular, the resulting distribution p is said to be order-modular. In this case, p G(G) factorizes as p G(G) = Q

i p Gi(Gi), giving:

p(σ, G) p G(G)1G|=σ = Y

i p Gi(Gi)1Gi σ<i (1)

where we have omitted the dependence on the dataset and write p(σ, G) for the Bayesian posterior.

4.1. Hierarchical CIs

Unfortunately, the representation of the order-modular distribution in Equation 1 is not tractable: we cannot easily sample from it, nor can we efﬁciently deduce, for instance, the marginal probability of a given edge. Our goal is thus to obtain a representation approximating this distribution which does possess tractable properties. The key idea is that by exploiting exact conditional independences (CIs) in the distribution, we can hierarchically break the approximation of the original distribution into smaller subproblems.

To illustrate this, we ﬁrst deﬁne some notation. Given any variable subset S {1, ..., d}, let σS denote an ordering (permutation) over variables in S, and GS {Gi : i S} denote the set of parent sets for each variable i in S.

Now, suppose we partition the set of BN variables {1, ..., d} into two subsets (S1, S2), and consider conditioning on the event that all variables in S1 come before S2 in the ordering, that is, the order partitions as σ = (σS1, σS2). In this case, the conditional distribution can be written as:

p(σ, G|σ = (σS1, σS2)) Y

i p Gi(Gi)1Gi (σS1,σS2)<i

i S1 p Gi(Gi)1Gi σ<i S1

i S2 p Gi(Gi)1Gi S1 σ<i S2

Notice that the distribution has factorized into two terms, which respectively include only (σS1, GS1) and (σS2, GS2). In fact, if we deﬁne the following (unnormalized) distribution over (σS2, GS2):

p S1,S2(σS2, GS2) Y

i S2 p Gi(Gi)1Gi S1 σ<i S2

the previous factorization can be written as:

p(σ, G|σ = (σS1, σS2))

p ,S1(σS1, GS1) p S1,S2(σS2, GS2)

Thus, conditional on σ = (σS1, σS2), we have split the distribution over (σ, G) into distributions over only (σS1, GS1) and (σS2, GS2) respectively.

Now, let us consider arbitrary disjoint subsets S1, S2 {1, ..., d}. Then p S1,S2 is a distribution over σS2, GS2. We can apply a similar method to conditionally decompose p S1,S2(σS2, GS2) into distributions over S21, S22, where S21, S22 partition S2, given by the following Proposition:

Proposition 1. Let p(σ, G) p G(G)1G|=σ be an ordermodular distribution. Suppose that S1, S2 are any disjoint subsets of the variables {1, ..., d}, and let (S21, S22) be a partition of S2. Then the following CI holds:

p S1,S2(σS2, GS2|σS2 = (σS21, σS22))

p S1,S21(σS21, GS21) p S1 S21,S22(σS22, GS22)

These conditional independencies suggest an approximation strategy: select K partitions (S1, S2) of {1, ..., d} to form the approximation and, conditional on a partition, then independently approximate the resulting distributions p ,S1(σS1, GS1), p S1,S2(σS2, GS2), which are simpler problems of dimensions |S1|, |S2|, respectively. Using Proposition 1, this can be done recursively, until we obtain distributions where S2 is a singleton {i}, where:

p S1,{i}(σ{i}, Gi) = p Gi(Gi)1Gi S1 σ<i {i} = p Gi(Gi)1Gi S1

4.2. Order SPNs

The decomposition process can be viewed as a rooted tree, where we alternate between nodes that select partitions, and those which decompose the conditional distribution. This naturally induces a sum-product network structure, which we formalize in the following deﬁnition:

Deﬁnition 1. An Order SPN qφ is a sum-product network over (σ, G) with the following structure:

Each leaf node L is associated with (S1, {i}), for some subset S1 of {1, ..., d} and i / S1, and has scope sc(L) = (σ{i}, Gi). In addition, the leaf node distribution must have support only over graphs Gi S1.

Each sum node T is associated with two disjoint subsets (S1, S2) of {1, ..., d}, where |S2|> 1, and has scope sc(T) = (σS2, GS2). It has KT children and weights φT,i for i = 1, ..., KT , where the ith child is a product node P associated with (S1, S21,i, S22,i) for some partition (S21,i, S22,i) of S2.

Each product node P is associated with three disjoint subsets (S1, S21, S22) of {1, ...d}, and has scope sc(P) = (σS21 S22, GS21 S22), where σS21 S22 takes the form (σS21, σS22). It has two children, where the ﬁrst child is associated with (S1, S21), and the second with (S1 S21, S22). These children are either sum-nodes or leaves.

Tractable Uncertainty for Structure Learning

{1, 2, 3, 4}

+ + + + + + {1, 2}

{1, 2} {3, 4}

{2, 3} {1, 4}

{1, 4} {2, 3}

L L L L L L L L {1}

0.4 0.15 0.45

0.7 0.3 0.5 0.5

(a) Regular Order SPN, with expansion factors K = (3, 2). Each sum/leaf node is labelled with its associated (S1, S2). Only one expansion beyond the ﬁrst level is shown for clarity.

(S1, S2) Example Orders Example Graph

( , {1, 2, 3, 4}) (1, 2, 4, 3) (2, 3, 1, 4) (4, 1, 3, 2)

({1, 2}, {3, 4}) (3, 4) (4, 3)

({1, 2}, {4}) (4)

(b) Example orders and graphs for 3 sum/leaf nodes. Graphs only include parent sets of S2 (ﬁlled) variables.

Figure 1. Example of regular Order SPN for d = 4. Best viewed in color.

We can interpret each sum (or leaf) node T associated with (S1, S2) as representing a distribution over DAGs over variables S2, where these variables can additionally have parents from among S1. In other words, every sum node represents a (smaller) Bayesian structure learning problem over a set of variables S2 and a set of potential confounders S1.

In practice, we organize the SPN into alternating layers of sum and product nodes, starting with the root sum node. In the jth sum layer, we create a ﬁxed number Kj of children for each sum node T in the layer. Further, for each child i of each sum node T, we choose (S21,i, S22,i) such that |S21,i|= |S2|

2 , |S22,i|= |S2|

2 , and further require that the partitions are distinct for different children i of T. Under these conditions, the Order SPN will have log2(d) sum (and product) layers. This ensures compactness of the representation, and enables efﬁcient tensorized computation over layers. We call such Order SPNs regular, and the associated list K of numbers of children for each layer are called the expansion factors. An example of a regular Order SPN is shown in Figure 1. At the top sum layer, we create a child for K1 = 3 different partitions of {1, 2, 3, 4} into equally sized subsets, each of which has an associated weight. Sum and product layers alternate until we reach the leaf nodes.

The leaf nodes L represent distributions over some column of the graph: if L is associated with (S1, i), then it expresses a distribution over the parents Gi of variable i. The interpretation of S1 is that this distribution should only have support over sets Gi S1. This restriction ensures that Order SPNs are consistent, in the sense that they represent distributions over valid (σ, G) pairs (in particular, all graphs are acyclic):

Proposition 2. Let qφ be an Order SPN. Then, for all pairs (σ, G) in the support of an Order SPN, it holds that G |= σ.

By design, (regular) Order SPNs satisfy the standard SPN properties that make then an efﬁcient representation for inference, which we show in the following Proposition. In the following sections, we use these properties for query computation, as well as for learning the SPN parameters.

Proposition 3. Any Order SPN is complete and decomposable, and regular Order SPNs are additionally deterministic.

4.3. Leaf Distributions

Given a leaf node associated with (S1, i), corresponding to a distribution on Gi, the only restriction imposed by the deﬁnition is that the distribution has support only on graphs Gi S1. Given that we are approximating an order-modular distribution p(σ, G) Q

i p Gi(Gi)1Gi σ<i, the natural choice of (unnormalized) leaf distributions is p S1,{i}(σ{i}, Gi) p Gi(Gi)1Gi S1; we provide formal justiﬁcation for this choice in Proposition 7 in the Appendix.

For tractable inference on the overall distribution over (σ, G), we require that the leaf distributions can be computed tractably. In particular, we will be interested in three types of tasks: marginal/conditional inference, MPE inference, and (conditional) sampling. To formalize this, let ai,j be a Boolean variable indicating whether j Gi, i.e., j is a parent of i. Further, let ci be any logical conjunction of the corresponding positive or negative literals, i.e. ai,j or ai,j. For instance, ci = ai,0 ai,1 ai,2 represents the event that 0, 1 are parents of i, but not 2. Then, the task of marginal inference is to evaluate the probability p S1,{i}(ci = 1). Conditional inference is the task of p S1,{i}(ci = 1|c i = 1) for two conjunctions ci, c i. MPE inference is max Gi p S1,{i}(Gi|ci = 1), while conditional sampling is the task of sampling from p S1,{i}(Gi|ci = 1).

Tractable Uncertainty for Structure Learning

Unfortunately, these inference queries are intractable to compute without further assumptions. Following previous work (Kuipers et al., 2018; Viinikka et al., 2020), we limit the parents of each variable i to a ﬁxed set of candidates parents Ci {1, ..., d} \ {i}, where the size of |Ci| is chosen to be manageable (around 16). These candidate sets are chosen to maximize the coverage of the distribution mass. Given this, we approximate p S1,{i}(Gi) p Gi(Gi)1Gi S1 Ci.

Given this approximation, we can then perform a precomputation taking O(3|Ci|) time and space complexity, after which all of these queries require just a O(1) lookup, except conditional sampling, which takes time O(|Ci|). We provide further details of the method in Appendix B.

4.4. Tractable Queries

In general, being able to compute inference queries tractably individually for single-node distributions is not sufﬁcient to perform inference on the overall distribution over DAGs. While previous works have tackled this problem by sampling single DAGs or orders, our key insight is that we can leverage the tractable properties of SPNs to hierarchically aggregate over order components. We now characterize the classes of queries that can be computed tractably for Order SPNs, and their interpretation in the context of structure learning. Below we will write qφ(G) to denote the marginal of G in qφ(σ, G), and denote the size of the SPN by M.

Marginal and conditional inference Let ci, c i be conjunctions over the graph column Gi. Then the marginal inference problem is to compute qφ(Vd i=1 ci). This can be interpreted as the probability of any arbitrary combination of edges (direct causal relations) simultaneously being present. It is well known that marginal/conditional inference queries can be computed exactly for a complete and decomposable SPN in linear time in the size of the circuit (Poon & Domingos, 2011). Since marginal inference for the individual leaves requires just a constant-time lookup, the overall complexity is O(M).

MPE inference MPE inference is the problem of ﬁnding the most likely instantiation of the variables, given some evidence. More precisely, we wish to compute max G qφ(G|Vd i=1 ci), which allows us to, for instance, ﬁnd the most likely extension of a partially speciﬁed DAG. This is tractable (in linear-time) provided that the SPN is deterministic (Choi & Darwiche, 2017). As MPE inference on the individual leaves requires just a constant-time lookup, the overall complexity is once again O(M).

Sampling Unconditional sampling from the Order SPN is straightforward and efﬁcient; we traverse the SPN top-down, choosing one child of each sum-node, and all children of each product-node, until we reach the leaf nodes, taking

Query Time Space

Marginal/Conditional O(M) O(M) MPE O(M) O(M) Sampling O(d2) O(d2) Conditional Sampling O(d2 + M) O(d2 + M) Pairwise Causal Effects O(d3M) O(d2M)

Table 1. Per-query complexity for Order SPNs, for d variables and Order SPN of size M

linear time in d. Coupled with the cost of sampling the leafnode distributions, the overall complexity is O(d maxi|Ci|) per sample. Conditional sampling is more involved, and requires an O(M) bottom-up computation which updates the SPN weights/probabilities according to the evidence, before sampling via top-down traversal (Vergari et al., 2019).

Causal effects We now turn to the computation of other types of queries speciﬁc to the Bayesian network setting. In the well-studied case of linear Gaussian Bayesian networks, one of the most important quantities for causal inference is pairwise causal effects, ﬁrst studied by Wright (1934) as the method of path coefﬁcients . In particular, for a given graph G and weights B, the causal effect of Xi on Xj, written Eij(B), is given by summing the weight of all directed paths from i to j, where the weight of a path is given by the product of the weights of the edges along that path. Notice that, in cases where i is not an ancestor of j, Eij(B) = 0. Now, a priori, when we do not know the graph or weights, the causal effect is a random variable given by:

π F ({1,...,d}\{i,j}) Bi,π1Bπ|π|,j

i=1 Bπi,πi+1

where F(S) is the family of all ordered subsets of the variables S. From the Bayesian perspective, we would like to employ Bayesian model averaging to estimate the causal effect. This is given by:

BCE(i, j) EG qφ(G)[EB q(B|G)[Eij(B)]]

While the other queries we have analyzed describe properties of the distribution qφ(G) over causal graphs, here we are concerned with causal inference on the domain variables X themselves (induced by qφ(G)). This is a signiﬁcant distinction for two reasons. Firstly, it is often the case that such quantities are of great practical interest, for instance, to estimate the effect of various types of treatments/interventions on patient outcomes. Secondly, even with full knowledge of the causal graph, inference in Bayesian networks is NPhard in general, making it unclear how to efﬁciently transfer knowledge about the distribution over causal graphs to distributions over the domain variables. Fortunately, we ﬁnd

Tractable Uncertainty for Structure Learning

that in the case of linear Gaussian BNs, Order SPNs possess the appropriate structure to compute Bayesian averaged causal effects efﬁciently:

Proposition 4. Given an Order SPN representation qφ of the distribution over DAGs, the matrix of all pairwise Bayesian averaged causal effects BCE(i, j) with respect to qφ can be computed in O(d3M) time and O(d2M) space, where M is the size of the SPN.

The factor of O(d3) is unsurprising and arises from the inference cost of causal effects in linear Gaussian BNs (Koller & Friedman, 2009); the signiﬁcance is in the linear complexity in the size of the Order SPN, given that BCE(i, j) averages over potentially exponentially more DAGs. Intuitively, this is achieved by summing-out over different causal (directed) paths between variables i, j at each node. We provide more details and a formal proof in Appendix C.

5. Structure Learning with Order SPNs

In this section we propose a framework for learning Order SPNs from data. This consists of two components. Firstly, we learn a structure for the Order SPN, which characterizes the support of the distribution. Then, we optimize the parameters of the SPN using a variational inference scheme.

5.1. SPN structure learning

We focus on regular Order SPNs, where the topology of the SPN is ﬁxed, but for each sum node T in layer j we must choose the Kj partitions (S21,i, S22,i) of S2. We deﬁne the problem as choosing an oracle O which takes as input some data D, disjoint sets S1, S2, and a number of samples K, and returns K partitions (S21,i, S22,i) of S2. The goal of the oracle is to maximize coverage of the posterior distribution, i.e. the posterior mass of orders consistent with at least one of the sampled partitions. In practice, we can instantiate the oracle with any Bayesian structure learning method that can be modiﬁed to produce a DAG over S2, which can additionally have parents from S1. In particular, we adapt two recent Bayesian structure learners, DIBS (Lorch et al., 2021) and GADGET (Viinikka et al., 2020). Given such a method, we can deﬁne the oracle by (i) taking K samples of such DAGs; (ii) for each sample, choosing a random ordering consistent with the DAG; and (iii) splitting the ordering into a partition. Each partition (S21,i, S22,i) inﬂuences the support of the corresponding child of T, by restricting that S21 comes before S22 in the ordering.

The proposed strategy involves calling the oracle O for each sum-node in the Order SPN. This improves exploration of the space over the base structure learning method, by recursively exploring subspaces of DAGs over smaller subsets of variables S2 {1, ..., d}. However, it also appears to introduce a computational challenge since the number

of sum-nodes in the SPN could be very large. Thankfully, though each successive sum-layer has 2Kj more sum-nodes, the dimension of the DAG space is halved, meaning that the oracle requires much less time. In practice, we ensure efﬁcient implementation by the following methods: (i) we set a time budget appropriately for the oracle in each layer; (ii) for small dimensions |S2| d (chosen to be 4), we avoid the constant-time overhead of each oracle run by instead explicitly enumerating over all partitions.

5.2. Parameter Learning via Variational Inference

Given a SPN structure, we now consider the task of learning the parameters of the SPN. We formulate this as a discrete variational inference (VI) problem. Given an unnormalized order-modular distribution p(σ, G) = p G(G)1G|=σ, the evidence lower bound (ELBO) is given by:

Eqφ(σ,G)[log p(σ, G)] + H(qφ(σ, G))

where H(qφ(σ, G)) = Eqφ(σ,G)[log qφ(σ, G)] is the entropy of the Order SPN q. The goal of VI is then to maximize the ELBO with respect to φ. Typically, such discrete VI problems are difﬁcult, since the ELBO requires computing (gradients of) the expectation using high-variance estimators such as REINFORCE (due to the discrete space, the reparameterization trick is not applicable). Fortunately, due to the tractable properties of the SPN, this is not an issue:

Proposition 5. The ELBO and its gradients for any regular Order SPN qφ and order-modular distribution p can be computed in linear time in the size of the SPN.

We provide the proof in Appendix E, which is based on the corresponding result (Thm 1) from Shih & Ermon (2020) for deterministic SPNs. This allows us to learn the SPN parameters using gradient-based optimization. Further, due to the layered structure of regular Order SPNs, we can leverage tensor learning frameworks and hardware acceleration.

6. Experiments

In this section, we perform an empirical validation of the TRUST framework.1 We implement two state-of-theart Bayesian structure learning methods, (marginal) DIBS (Lorch et al., 2021) and GADGET (Viinikka et al., 2020), and compare them with their TRUST-enhanced counterparts, TRUST-D and TRUST-G, which use the respective method as the oracle for Order SPN structure learning. Each inference method is applied to synthetic structure learning problems, where the ground truth causal structures are Erd os-R enyi random graphs with dimension d {16, 32} and 2d expected edges, and the Bayesian network distribution is linear

1Our implementation is available at https://github. com/wangben88/trust.

Tractable Uncertainty for Structure Learning

(a) (b) (c) (d)

Figure 2. Performance evaluation of the TRUST framework. We ﬁnd that across all metrics and for both dimensionalities that the TRUST framework outperforms the seed method, in some instances considerably. Top Row: Learning structures with d = 16. Bottom Row: Learning structures with d = 32. (a) Expected Structural Hamming Distance, lower is better. (b) Marginal Log Likelihood (higher is better). (c) Area Under the Receiver Operator Characteristic curve (higher is better). (d) MSE of Causal Effects (lower is better).

Gaussian. All methods tested employ the BGe marginal likelihood. For each experiment, a dataset Dtrain of N = 100 datapoints is generated for each graph for inference.

6.1. Learning Performance

We begin by evaluating the quality of the inferred posterior q(G|D) for each inference method, over a variety of standard metrics. In what follows, we use G, B to denote the true graph/edge weights respectively, and Dtest to denote a held-out dataset of 1000 datapoints.

The expected structural Hamming distance E-SHD(q, G) measures the expected number of edge changes (SHD) between the essential graphs of G and G , where G is sampled from the posterior q:

E-SHD(q, G) = EG q[SHD(essential(G ), essential(G))]

The area under the receiver operating characteristic curve AUROC(q, G) for Bayesian structure learning (Friedman & Koller, 2003) is computed using marginal edge probabilities q(G ij = 1) for each potential edge G ij, while varying the conﬁdence threshold to construct the ROC curve.

The marginal log-likelihood MLL(q, G, Dtest) measures how well the posterior ﬁts the held-out test data, using the BGe marginal likelihood p:

MLL(q, G, Dtest) = EG q[log p(Dtest|G )]

Finally, the mean-squared error of causal effects MSE-CE(q, B) measures the squared difference between the expected posterior causal effect BCEq(i, j), and the true causal effect Eij(B) (for variable pair i, j). This is then averaged over all (distinct) pairs i, j:

MSE-CE(q, B) = 1 d(d 1)

i =j |BCEq(i, j) Eij(B)|2

We show the results in Figure 2 for all methods. TRUSTD and TRUST-G match or outperform their counterparts across all metrics, with especially strong performance on E-SHD, where TRUST-G is best by a clear margin for both d = 16, 32. Interestingly, we ﬁnd that both the oracle methods used for SPN structure learning and the subsequent VI parameter learning are important for achieving the best posterior approximation; see Appendix G for further details.

6.2. Coverage and Query Answering

We now compare the query answering capabilities of TRUST to DIBS and GADGET, for d = 16 networks. We set up the task by selecting n edges randomly from the true graph, which we use to form the condition Vd i=1 c i in Section 4.4 (requiring that all n edges are present). A good representation of the posterior should consistently have posterior mass over this condition. For DIBS and GADGET, we obtain

Tractable Uncertainty for Structure Learning

Figure 3. As we specify more edges in our query, the probability that sample-based posteriors (DIBS and GADGET) have support over the queried edges drops. TRUST-D and TRUST-G, in contrast, maintain much greater coverage.

sample-based approximations q of the posterior, for which we take 30 and 10000 samples respectively, as indicated by the respective papers and reference implementations. For TRUST-D and TRUST-G we directly perform the inference queries on the learned Order SPN.

We begin by considering the marginal probability q(Vd i=1 c i). In Figure 3, we compute this over 30 different runs and 50 random edge selections for each run, for different values of n, and plot the proportion of times that the probability is non-zero. We see that, as n increases, both methods based on TRUST consistently outperform their counterparts. This demonstrates how TRUST can be used to augment an oracle method to signiﬁcantly improve the reliability of posterior coverage. This is particularly noteworthy for DIBS, whose coverage is otherwise limited by its quadratic time complexity in the number of samples.

From a practical perspective, this is especially important for conditional inference. In Table 2, we simulate a scenario where we obtain information on the true causal graph after learning. In particular, given n = 4, 8, 16 randomly speciﬁed edges from the true graph as a condition, we compute conditional probabilities for all unspeciﬁed (potential) edges. This can be viewed as injecting causal information, which, for instance, could permit distinguishing between DAGs in the same Markov equivalence class where observational data would not sufﬁce. To evaluate, we compute the AUROC given the computed probabilities for each edge. In the case where the representation q has no probability over the condition, we simply take the overall AUROC for the unconditional distribution. Table 2 shows mean and standard deviation for AUROC over 30 runs for each method. As the number of speciﬁed edges increases, we see that the performance of GADGET degrades despite the extra information, since the sample-based representation suffers from prohibitively high variance when estimating condi-

No. Edges Method AUROC

4 GADGET 0.905 0.073 TRUST-G 0.903 0.057

8 GADGET 0.888 0.089 TRUST-G 0.933 0.048

16 GADGET 0.876 0.081 TRUST-G 0.957 0.077

Table 2. Quality of inference for conditional queries. Results show TRUST-G is signiﬁcantly better at inferring conditional distributions, especially as the condition becomes more restrictive.

tional probability. On the other hand, the greater coverage of TRUST-G ensures that we can take advantage of the extra information, improving the quality of inferences.

7. Conclusion

We study the problem of tractable representations in Bayesian structure learning. Such representations are crucial for being able to effectively learn and reason about causal structures with uncertainty. In particular, we introduce Order SPNs, a new approximate representation of distributions over orders and structures. We show that Order SPNs enable tractable and exact inference over the representation for a variety of important classes of queries, including, remarkably, inference of causal effects for linear Gaussian networks. Our experimental results demonstrate that Order SPNs can indeed improve upon the representations of state-of-the-art Bayesian structure learning methods, with greater posterior coverage and query answering capabilities.

Our ﬁndings illustrate the potential of using tractable probabilistic representations to represent distributions over causal hypotheses. We anticipate that such representations could be applicable to a variety of tasks in causal inference, such as designing optimal interventions (Agrawal et al., 2019), though we leave the investigation of these to future work. Also, while we have chosen to focus on order-modular distributions and SPNs, it is an interesting question whether other types of distributions and tractable representations could be implemented. For instance, probabilistic sentential decision diagrams (Kisa et al., 2014) are a type of probabilistic circuit that admit a wider range of queries than SPNs. Such representations could offer alternative tractability properties that make them suitable for differing applications.

Acknowledgements We thank the anonymous reviewers for their valuable feedback and suggestions. This project was funded by the ERC under the European Union s Horizon 2020 research and innovation programme (FUN2MODEL, grant agreement No.834115).

Tractable Uncertainty for Structure Learning

Agrawal, R., Uhler, C., and Broderick, T. Minimal i-map mcmc for scalable structure discovery in causal dag models. In International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 89 98, 2018.

Agrawal, R., Squires, C., Yang, K. D., Shanmugam, K., and Uhler, C. Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 3400 3409, 2019.

Annadani, Y., Rothfuss, J., Lacoste, A., Scherrer, N., Goyal, A., Bengio, Y., and Bauer, S. Variational causal networks: Approximate bayesian inference over causal structures. ar Xiv preprint ar Xiv:2106.07635, 2021.

Castelletti, F. and Consonni, G. Bayesian inference of causal effects from observational data in gaussian graphical models. Biometrics, 77(1):136 149, 2021.

Chan, H. and Darwiche, A. On the robustness of most probable explanations. In Proceedings of the Twenty Second Conference on Uncertainty in Artiﬁcial Intelligence, UAI 06, pp. 63 71, Arlington, Virginia, USA, 2006. AUAI Press. ISBN 0974903922.

Chickering, D. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Research, 3:507 554, 01 2002. doi: 10.1162/153244303321897717.

Choi, A. and Darwiche, A. On relaxing determinism in arithmetic circuits. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 825 833, 2017.

Choi, Y., Vergari, A., and Van den Broeck, G. Probabilistic circuits: A unifying framework for tractable probabilistic models. Technical report, oct 2020. URL http://starai.cs.ucla.edu/papers/ Prob Circ20.pdf.

Cundy, C., Grover, A., and Ermon, S. Bcd nets: Scalable variational approaches for bayesian causal discovery. In Advances in Neural Information Processing Systems, volume 34, 2021.

Dennis, A. and Ventura, D. Greedy structure search for sum-product networks. In International Joint Conference on Artiﬁcial Intelligence, pp. 932 938, 2015.

Eggeling, R., Viinikka, J., Vuoksenmaa, A., and Koivisto, M. On structure priors for learning bayesian networks. In International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 1687 1695, 16 18 Apr 2019.

Friedman, N. and Koller, D. Being bayesian about network structure. A bayesian approach to structure discovery in bayesian networks. Machine Learning, 50(1-2):95 125, 2003. doi: 10.1023/A:1020249912095.

Gens, R. and Pedro, D. Learning the structure of sumproduct networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 873 880, 17 19 Jun 2013.

Giudici, P. and Castelo, R. Improving markov chain monte carlo model search for data mining. Machine Learning, 50(1):127 158, 2003.

Glymour, C., Zhang, K., and Spirtes, P. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10:524, 2019. ISSN 1664-8021. doi: 10. 3389/fgene.2019.00524.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kisa, D., Van den Broeck, G., Choi, A., and Darwiche, A. Probabilistic sentential decision diagrams. In International Conference on Principles of Knowledge Representation and Reasoning, July 2014.

Koivisto, M. Advances in exact bayesian structure discovery in bayesian networks. In Conference on Uncertainty in Artiﬁcial Intelligence, 2006.

Koivisto, M. and Sood, K. Exact bayesian structure discovery in bayesian networks. The Journal of Machine Learning Research, 5:549 573, 2004.

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.

Kuipers, J., Moffa, G., and Heckerman, D. Addendum on the scoring of gaussian directed acyclic graphical models. The Annals of Statistics, 42(4), Aug 2014. ISSN 00905364. doi: 10.1214/14-aos1217.

Kuipers, J., Suter, P., and Moffa, G. Efﬁcient sampling and structure learning of bayesian networks. ar Xiv preprint ar Xiv:1803.07859, 2018.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, volume 29, 2016.

Tractable Uncertainty for Structure Learning

Lorch, L., Rothfuss, J., Sch olkopf, B., and Krause, A. Dibs: Differentiable bayesian structure learning. In Advances in Neural Information Processing Systems, volume 34, 2021.

Maathuis, M. H., Colombo, D., Kalisch, M., and B uhlmann, P. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247 248, 2010.

Madigan, D., York, J., and Allard, D. Bayesian graphical models for discrete data. International Statistical Review/Revue Internationale de Statistique, 63(2):215 232, 1995.

Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X.

Peharz, R., Gens, R., and Domingos, P. Learning selective sum-product networks. In ICML Workshop on Learning Tractable Probabilistic Models, 06 2014.

Peharz, R., Gens, R., Pernkopf, F., and Domingos, P. M. On the latent variable interpretation in sum-product networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(10):2030 2044, 2017.

Poon, H. and Domingos, P. Sum-product networks: A new deep architecture. In Conference on Uncertainty in Artiﬁcial Intelligence, 2011.

Rooshenas, A. and Lowd, D. Learning sum-product networks with direct and indirect variable interactions. In International Conference on International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 2014.

Shih, A. and Ermon, S. Probabilistic circuits for variational inference in discrete graphical models. In Advances in Neural Information Processing Systems, volume 33, 2020.

Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. Causation, prediction, and search. MIT press, 2000.

Tsamardinos, I., Brown, L. E., and Aliferis, C. F. The maxmin hill-climbing bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006.

Vergari, A., Di Mauro, N., and Esposito, F. Visualizing and understanding sum-product networks. Machine Learning, 108(4):551 573, apr 2019. ISSN 0885-6125. doi: 10. 1007/s10994-018-5760-y.

Viinikka, J., Hyttinen, A., Pensar, J., and Koivisto, M. Towards scalable bayesian learning of causal dags. In Advances in Neural Information Processing Systems, volume 33, 2020.

Wright, S. The method of path coefﬁcients. The Annals of Mathematical Statistics, 5(3):161 215, 1934. ISSN 00034851.

Tractable Uncertainty for Structure Learning

A. Proofs of Order SPN results

In this section, we provide further details on the properties of Order SPNs. Firstly, we prove Proposition 1, regarding decompositions for order-modular distributions. Then, we provide proofs for Propositions 2 and 3 from the main paper, which show that (regular) Order SPNs are consistent over orders and graphs, and satisfy the required properties for efﬁcient inference. Finally, we formally characterize the compactness of the Order SPN representation in a new result.

A.1. Hierarchical Decomposition

Recall that in Section 4, we showed that the distribution on orders and graphs (σ, G) could be decomposed into a product of two distributions on (σS1, GS1) and (σS2, GS2) respectively, conditional on σ = (σS1, σS2), i.e. the event that all variables in S1 come before S2 in the ordering. We now prove the generalization of that result, which allows us to hierarchically decompose the distribution, giving rise to the proposed Order SPN structure.

Proposition 1. Let p(σ, G) p G(G)1G|=σ be an ordermodular distribution. Suppose that S1, S2 are any disjoint subsets of the variables {1, ..., d}, and let (S21, S22) be a partition of S2. Then the following CI holds:

p S1,S2(σS2, GS2|σS2 = (σS21, σS22))

p S1,S21(σS21, GS21) p S1 S21,S22(σS22, GS22)

Proof. By deﬁnition, we have that p S1,S2(σS2, GS2) = Q

i S2 p Gi(Gi)1Gi S1 σ<i S2 . Conditioning on the event

σS2 = (σS21, σS22), we have that:

p S1,S2(σS2, GS2|σS2 = (σS21, σS22))

i S2 p Gi(Gi)1Gi S1 σ<i (S21,S22)

i S21 p Gi(Gi)1Gi S1 σ<i S21 Y

i S22 p Gi(Gi)1Gi S1 S21 σ<i S22

= p S1,S21(σS21, GS21) p S1 S21,S22(σS22, GS22)

as required.

A.2. Order SPN properties

Proposition 2. Let qφ be an Order SPN. Then, for all pairs (σ, G) in the support of an Order SPN, it holds that G |= σ.

Proof. A complete subcircuit C (Chan & Darwiche, 2006; Dennis & Ventura, 2015) is obtained by traversing the circuit top-down and i) selecting one child of every sum-node;

ii) selecting all children of every product-node; iii) selecting all leaf nodes reached. By removing (or equivalently, setting to 1) all sum-node weights, C is itself an Order SPN expressing a distribution over (σ, G). The key point is that the order is determined in any complete subcircuit. At the leaf nodes, the orders σ{i} over singletons are trivially deterministic. At the product nodes in the subcircuit, the order is determined by the order speciﬁed by the ﬁrst (left) and second (right) child. That is, for a product node P in the subcircuit associated with (S1, S21, S22), if the left child speciﬁes an order σS21 and the right child an order σS22, then the order for P is determined as (σS21, σS22). Finally, the sum nodes in the subcircuit only have one child, so the order is determined from its child. Let the uniquely determined order for subcircuit C be denoted σC.

Now, consider any path from the root node to a leaf node in the subcircuit. Label the sum nodes (and leaf node) reached Ti for i = 1, ..., m (for some m), associated with (S1,i, S2,i) respectively. We will now show, by induction, that for each sum node Ti, it is the case that all variables in S1,i come before S2,i in the ordering σC.

The root is associated with (S1,1, S2,1) = ( , {1, ...d}), so the condition is trivially satisﬁed.

Now, given node Ti with i < m, by deﬁnition Ti has a product node child Pi such that Ti+1 is either the ﬁrst or second child of Pi. Let Pi be associated with (S1,i, S21,i, S22,i). Then, (i) if Ti+1 is the ﬁrst child of Pi, then (S1,i+1, S2,i+1) = (S1,i, S21,i), while (ii) if Ti+1 is the second child of Pi, then (S1,i+1, S2,i+1) = (S1,i S21,i, S22,i). Now, σC has the property that all nodes in S21,i come before those in S22,i. Given the inductive hypothesis that S1,i comes before S2,i in the ordering, in both cases (i) and (ii) we have that all nodes in S1,i+1 come before nodes in S2,i+1 in the ordering.

This means that, at any leaf node associated with some (S1, {i}), it will be the case that S1 comes before i in σC. Since the leaf distribution only has support over graphs with Gi S1, it follows that all graphs G in the support satisfy G |= σC.

The overall distribution of the Order SPN is given by a (weighted) sum of all complete subcircuits, so the result follows.

Proposition 3. Any Order SPN is complete and decomposable, and regular Order SPNs are additionally deterministic.

Proof. Given any sum node T in the Order SPN, completeness follows since the ith product node has scope

Tractable Uncertainty for Structure Learning

(σS21,i S22,i, GS21,i S22,i) = (σS2, GS2), as S21,i, S22,i partitions S2 by deﬁnition. Decomposability follows immediately from the scopes of the product nodes P and their children, where the variables (σS21 S22, GS21 S22 are split into sum (or leaf) nodes with scope (σS21, GS21) and (σS22, GS22), where S21, S22 are disjoint. Determinism holds for regular Order SPNs since every sum node has children which split the order into different partitions, so that the children have distinct support over orders (in fact, the choice of child at each sum node can be viewed as determining the order).

A.3. Compactness of Order SPNs

When organized as a regular Order SPN, we can further characterize the compactness of the representation. The following result shows that Order SPNs can be exponentially more compact than sample representations of orders:

Proposition 6. Given a regular Order SPN qφ over d = 2l

variables, with l sum (and product) layers and expansion factors (K0, ...Kl 1) as above, then we have that:

The size (number of edges) of qφ is given by: Pl i=1(2i + 2i 1) Q

The size (number of orders) of the support of qφ is given by: Ql 1 i=0 K2i i

Proof. Let Ti, Pi be the ith layer of the Order SPN, with |Ti|, |Pi| nodes respectively, for i = 0, ..., l 1. We will also write Tl to denote the leaf layer following all of the other layers. Then, by deﬁnition, the nodes in the lth sum layer each have Kj children. Thus, |Pi|= Ki|Ti|. Each product node has two children, so we have the relation |Tl+1|= 2|Pl|. Since the ﬁrst sum layer P1 consists of just a single root node, |P0|= 1, and it can be easily checked that |Ti|= 2i Q

j<i Kj and |Pi|= 2i Q

j<i+1 Kj. Thus the total number of nodes is given by:

i=1 (2i + 2i 1) Y

Note that the structure of an Order SPN takes the form of a tree, i.e., each node has a unique parent (except the root). Thus, the number of edges in the Order SPN is equal to the number of nodes, excluding the root. Now, each node represents a distribution over orders and graphs restricted to some subset of variables. Let N(Ti) denote the number of distinct orders in the support of the ﬁrst node in Ti (similar for N(Pi)). Notice that, since d = 2l, all nodes in the layer Ti have support over the same number of orders. Thus, we need

only consider how the number of orders covered changes as we move through the layers. Firstly note that, for the leaf layer Tl, all nodes express distributions over (σ{i}, Gi) for some variable i. There is only one possible permutation over a singleton set, so N(Tl) = σ{i}. Then, for any sum-node in Ti, by determinism, each child of Ti has disjoint support, so it follows that N(Ti) = Ki N(Pi). For any product-node in Pi, we have that the two children of Pi express distributions over orders/permutations σS21, σS22, where S21, S22 are disjoint sets. Since each of the children have support over N(Ti+1) orders, the product node expressing a distribution over σS21 S22 has N(Pi) = N(Ti+1)2. It is worth comparing this to the corresponding relation |Tl+1|= 2|Pl| above; the conditional independence asserted by the Order SPN results in the compactness of the representation. We can now see (by induction) that N(Pi) = Ql 1 j=i+1 K2j i+1 j and N(Ti) = Ql 1 j=i K2j i j , and so the root node has support size:

B. Computation of leaf distributions

We now explain in detail how to perform marginal, conditional, MPE and sampling inference for leaf distributions.

Recall that leaf distributions for variable i in an Order SPN are given by the following density:

p S1,{i}(Gi) = p Gi(Gi)1Gi S1 P Gi S1 p Gi(Gi)

where S1 Ci is the set of potential parents of variable i, and where we have explicitly included the normalizing constant.

As the dimension d increases, this is challenging to compute due to the (exponential) sum over subsets in the normalizing constant. Further, different leaves of the Order SPNs will in general have different sets S1, due to the conditions on variable ordering imposed by the SPN structure.

Thus, following previous work (Friedman & Koller, 2003; Kuipers et al., 2018), we globally limit the parents of variable i to a candidate set Ci. That is, for each leaf node for variable i with distribution p S1,{i}(Gi), we replace S1 with S1 Ci. While this inevitably restricts the coverage of the distribution over DAGs, we can choose the candidate parents Ci in such a way as to preserve as much posterior mass as possible2. As we will shortly see, this enables us

2(Viinikka et al., 2020) studied a number of different strategies for selecting these candidate parents; we use the Greedy heuristic, which was found empirically to be most effective.

Tractable Uncertainty for Structure Learning

to design precomputation schemes that then allow for inference queries on p S1,{i}(Gi) to be answered efﬁciently for any S1 Ci.

In contrast to previous work, we are interested not just in the densities/normalizing constants, but also more complex forms of inference. For this, we deﬁne ai,j to be the event that j Gi, i.e. j is a parent of i. Then, the key component is to precompute the following function, previously proposed in the Appendix of Viinikka et al. (2020) (for a different purpose):

fi(Ai, A i) = X

j Ai ai,j V

j A i ai,j p Gi(Gi)

where Ai, A i are disjoint subsets of Ci. Intuitively, this is the (unnormalized) probability that all variables in Ai are parents of i, and all those in A i are not parents of i.

This function can be precomputed in time and space O(3|Ci|) as follows. In the base case where Ai, A i partition Ci, then we simply have:

fi(Ai, A i) = p Gi(Ai)

since Ai, A i fully specify the parents of i.

In any other case, we have the recurrence:

fi(Ai, A i) = fi(Ai {b}, A i) + fi(Ai, A i {b})

for any b Ci \ (Ai A i). This can be seen from the deﬁnition of fi; the RHS corresponds to conditioning on the cases where b either is or is not a parent of i. Notice that, for each Ai, A i, we need just a constant-time addition; thus the overall complexity is given by the number of partitions of Ci into three subsets, i.e. O(3|Ci|).

Now, let ci be any conjunction of (positive or negative) literals of the atoms {ai,j : j Ci}, i.e., a partial speciﬁcation of which edges can and can t be transformed. We now propose methods for performing marginal, conditional, MPE and sampling inferences:

Marginal/Conditional: For distribution p S1,{i}, the marginal for formula ci is given by:

p S1,{i}(ci = 1) =

Gi|=ci p Gi(Gi)1Gi S1 P

Gi S1 p Gi(Gi)

j Ci\S1 ai,j p Gi(Gi) P

j Ci\S1 ai,j p Gi(Gi)

where we have expressed the condition that Gi S1 as the logical formula V

j Ci\S1 ai,j. Notice that both the numerator and denominator are of the form of

the precomputed fi, so we can compute the marginal probability simply by two lookups, i.e. O(1) per query.

Any conditional probability p S1,{i}(ci = 1|c i = 1) can be computed from marginals as p S1,{i}(ci =

1|c i = 1) = p S1,{i}(ci c i=1) p S1,{i}(c i=1) .

MPE: For distribution p S1,{i}, the MPE for formula ci is given by:

p S1,{i}(ci = 1) = max Gi p S1,{i}(Gi|ci = 1)

= max Gi|=ci p S1,{i}(Gi|ci = 1)

= max Gi|=ci V

j Ci\S1 ai,j p Gi(Gi) P Gi|=ci V

j Ci\S1 ai,j p Gi(Gi)

The maximum is over Gi satisfying a logical conjunction, similarly to how fi expresses sums over Gi satisfying logical conjunctions. Thus, we propose to precompute another function f max i , which is entirely similar to fi except that the recurrence is given by:

f max i (Ai, A i)

= max(f max i (Ai {b}, A i), f max i (Ai, A i {b}))

Analogously to fi, f max i computes the maximal probability p Gi(Gi) for all Gi satisfying the logical formula. Thus, once this function is precomputed, we can compute any MPE query through a lookup of f max i and a lookup of fi, i.e. O(1) per query.

Sampling: Given the condition ci, we would like to sample Gi from p S1,{i}(Gi|ci = 1). Let B Ci contain the variables which ci does not specify (as either deﬁnitely being a parent, or deﬁnitely not being a parent).

Then, given any ordering b1, ...b K of the elements of B, we can sample whether bk is present sequentially. When sampling bk, let d(k) i be a conjunction formula representing the sampling of b1, ...bk 1, e.g., di = ai,b1 ai,b2 ... ai,bk 1. Then we have:

p S1,{i}(abk = 1|d(k) i = 1, ci = 1)

= p S1,{i}(abk = 1|d(k) i ci = 1)

This takes the form of a conditional probability, which we can compute in constant time. We must apply this operation K = O(|Ci|) times, which leads to an overall complexity of O(|Ci|) per sampling query.

C. Causal Effect Computation

In this section, we show how to tractably compute Bayesian averaged causal effects with respect to Order SPN representations. The computation of BCE differs from the other

Tractable Uncertainty for Structure Learning

queries, as Eij involves terms which are not localized to a leaf-node distribution; thus standard SPN inference routines are not applicable. Nonetheless, we ﬁnd that it is possible to compute BCE(i, j) for all i, j exactly with respect to the probabilistic circuit representation over orders and graphs.

Proposition 4. Given an Order SPN representation qφ of the distribution over DAGs, the matrix of all pairwise Bayesian averaged causal effects BCE(i, j) with respect to qφ can be computed in O(d3M) time and O(d2M) space, where M is the size of the SPN.

Proof. Recall that all nodes t in the SPN can be associated with the variable subsets (S1, S2), and represent a distribution over the set of edges GS2 (in the case of product nodes, we deﬁne S2 = S21 S22). Thus, they also deﬁne a distribution over causal effects, given by:

E(t) ij = X

π F (S2\{j}) Bi,π1Bπ|π|,j

i=1 Bπi,πi+1

which is deﬁned for any distinct i S1 S2, j S2. Notice that this only counts paths which immediately enter (and stay in) S2; thus all edges are in GS2.

By taking the expectation, we can similarly deﬁne Bayesian averaged causal effects for node t:

BCE(i, j)(t) EGS2 q(t) φ (GS2)[EB q(BS2|GS2)[E(t) ij (BS2)]]

Given this, we now show how it is possible to decompose the computation of BCE(i, j) according to the structure of the SPN.

If t is a sum node, with children nodes t1, .., tk and corresponding weights φ(t) 1 , ...φ(t) C we simply have that:

BCE(i, j)(t) EGS2 q(t) φ (GS2)[EB q(BS2|GS2)[Eij(BS2)]]

c=1,...,C φ(t) c EGS2 q(tc) φ (GS2)[EB q(BS2|GS2)[Eij(BS2)]]

c=1,...,C φ(t) c BCE(i, j)(tc)

where we have used linearity of expectations to bring the sum outside.

If t is instead a product node, then it has two children t1, t2, which are associated with variable subsets (S1, S21), (S1 S21, S22), respectively. We now consider three separate cases, depending on where i S1 S2, j S2 are located within the subsets.

If i S22, j S21, then BCE(i, j)(t) = 0 since by construction edges (and by extension paths) from S22 to S21 are disallowed.

If i S1 S21 and j S21, or alternatively i S22 and j S22, then notice that all paths between i, j must stay within S21 or S22 respectively, since there are no edges from S22 to S21. Thus, we have that E(t) ij = E(t1) ij or E(t2) ij (respectively) and

BCE(i, j)(t) = BCE(i, j)(t1) or BCE(i, j)(t2)

In the ﬁnal case, i S1 S21 while j S22. Here we must consider all possible paths between i and j. To do so, we will condition on the last variable in S1 S21 ( exit-point ) k along a path. Then we have:

E(t) ij = X

π F (S2\{j}) Bi,π1Bπ|π|,j

i=1 Bπi,πi+1

π F (S21\{k}) Bi,π1Bπ|π|,k

i=1 Bπi,πi+1

π F (S22\{j}) Bk,π1Bπ|π|,j

i=1 Bπi,πi+1

k F (S21) E(t1) ik E(t2) kj

The last equality follows as the two summations are precisely the causal effects i k and k j for t1, t2, respectively, which correspond to variable subsets (S1, S21) and (S1 S21, S22). Now, by linearity of expectations, and the independence of E(t1) ik , E(t2) kj , this gives the matrix multiplication:

BCE(i, j)(t) = X

k F (S21\{i}) BCE(i, k)(t1)BCE(k, j)(t2)

Finally, we consider the leaf nodes t of the SPN, where |S2|= 1 (say, S2 = {j}). In such cases, the causal effect reduces to a(t) ij = Bi,j, and the expectation is given by:

BCE(i, j)(t) = EGj q(t) φ (Gj)[EBj q(Bj|Gj)[Bj]]

Given the graph column Gj, the distribution of Bj is given by a multivariate t-distribution (Viinikka et al., 2020), and so the inner expectation can be computed exactly for a given Gj. The outer expectation can be approximated using sampling from the leaf distribution. Though this involves sampling, the crucial aspect of our method is that the expectation through the Order SPN (and thus through different orders) is exact, unlike Beeps (Viinikka et al., 2020), which computes causal effects using sampled DAGs.

Tractable Uncertainty for Structure Learning

At each node t corresponding to variable subsets (S1, S2), we must maintain an array BCE(i, j)(t) for i S1 S2, j S2, i.e., of size (|S1|+|S2|) |S2|< d2. Computations at any node t take linear time in the number of children (outgoing edges) of the node, except for the matrix multiplication at product nodes, which takes (|S1|+|S21|) |S21| |S22|< d3 time. Thus, the overall space and time complexity is O(d2M) and O(d3M) respectively.

D. Order SPN Structure Learning Oracles

In this section we elaborate further the oracles O used for generating the structure of the Order SPN in Section 5.1. As previously deﬁned, the O takes as input some data D, disjoint sets S1, S2, and a number of samples K, and returns K partitions (S21,i, S22,i) of S2. The goal of the oracle is to maximize coverage of the posterior distribution, i.e. the posterior mass of orders consistent with at least one of the sampled partitions. Solving such a problem exactly is clearly intractable; thus we would like heuristic methods which can obtain good coverage.

A possible oracle would simply be to take K random partitions of S2. However, this does not make efﬁcient usage of the capacity of the Order SPN. Thus, we consider adapting other Bayesian structure learning methods to take the role of the oracle. This can be done by sampling DAGs from the method; each such DAG naturally induces an order over the variables S2, and thus a partition. Intuitively, we utilize their ability to ﬁnd promising areas of the space of orders and DAGs to choose a better structure for our SPN.

The key practical challenge is that, is in contrast to the typical use case, we are not just interested in learning a DAG over a set S2, but also want to allow the variables in S2 to have parents from some disjoint set S1. This will require adaptations speciﬁc to the particular method chosen. In the rest of this section, we provide brief descriptions of how this can be done for DIBS and GADGET.

DIBS is a Bayesian structure learning approach based on particle variational inference (Liu & Wang, 2016). In particular, in the marginal form, it assumes the following latentvariable generative model:

p(Z, G, D) = p(Z)p(G|Z)p(D|G)

where Z = [U, V ] with U, V Rk d (for some k < d) is a latent variable, generating the graph G {0, 1}d d and D is the dataset. In particular, the distribution for the graph takes the form:

i,j p(Gij|Z) = Y

i,j σ(u T i vj)

Now suppose that we want to learn a DAG over S2, where

variables can additionally have parents in S1. In this case, the natural generative model is to simply restrict to the components of the graph which are being modelled:

p S1,S2(G|Z) = Y

j S2 σ(u T i vj)

The marginal likelihood p(D|G) is modular, so we can additionally restrict the likelihood to only concern the likelihood of S2: p S2(D|G) = Y

j S2 p(Dj|Gj)

With these modiﬁcations, we have a valid generative model for any (S1, S2), to which the DIBS particle variational inference scheme can be applied with no further changes, giving us an oracle.

GADGET is a MCMC method which samples over the space of ordered partitions of the set of variables (note this is distinct from the 2-partitions we use in Order SPNs). Informally speaking, the ordered partition represents a partial ordering of the variables, where a variable must have a parent from the partition directly preceding its partition. For example, for d = 6, a partition might be 3, 4, 5, 1, 2, 6, where variable 1 must have one of 3, 4, 5 as parent (but not any of 2, 6). A k partition R is scored using a modular score:

j Rt τj( t 1 i=1Ri, Rt 1)

where τj(U, T) is the summed score (posterior probability) that variable j has all parents contained in the set U, and at least one parent in the set T.

As mentioned in Appendix B, GADGET (Viinikka et al., 2020) uses a similar type of precomputation to that used in TRUST to precompute the functions τj(U, T), where a candidate parent set Cj of each variable is chosen in advance (using a heuristic) so that we actually compute τj(U Cj, T Cj).

Now, suppose we are given sets (S1, S2), and as usual seek to learn DAGs over S2 which additionally have parents from S1. This can be achieved by simply restricting the MCMC to only learn ordered partitions over S2, while also allowing the parent sets of variables S2 to be contained in S1 S2. In particular, if we have precomputed τj(U Cj, T Cj) for all j and U, T {1, ..., d}, this includes all of the necessary scores τj(U Cj, T Cj) for all j S2 and U, T S1 S2, for any restriction (S1, S2).

The MCMC proceeds as if it were over a |S2| dimensional problem, over the set of variables S2, but with modiﬁed scores involving S1 as above, thus providing an oracle for TRUST.

Tractable Uncertainty for Structure Learning

E. Parameter Learning and Tractable ELBO Computation

In this section, we provide further details on the variational inference scheme used to learn parameters of the Order SPN. First, we provide a proof of Proposition 5, which is based on Theorem 1 from (Shih & Ermon, 2020).

Proposition 5. The ELBO and its gradients for any regular Order SPN qφ and order-modular distribution p can be computed in linear time in the size of the SPN.

Proof. We assume an order-modular distribution over the form p(σ, G) = Q

i p Gi(Gi)1G|=σ. Deﬁne p S2(σS2, GS2) Q

i S2 p Gi(Gi)1GS2|=σS2 for any S2 {1, ..., d}. For any node N in the Order SPN with scope (σS2, GS2), we will deﬁne the following quantity, which is the evidence lower-bound when using the distribution N(σS2, GS2) to approximate p S2(σS2, GS2):

ELBO(N) = EN[ p S2(σS2, GS2)] + H(N(σS2, GS2))

We now show that the ELBO for qφ can be computed efﬁciently (i.e. in linear time in the size of the SPN) as a function of the ELBO of the leaf node distributions.

Let T be a sum node associated with (S1, S2), with children C1, ..., CK and corresponding weights φ1, ..., φK. We can write the expectation of p S2 and entropy in terms of corresponding quantities of the child distributions:

ET [log p S2(σS2, GS2)] =

i=1 φi ECi[log p S2(σS2, GS2)]

H(T(σS2, GS2)) = ET [log T(σS2, GS2)]

i=1 φi ECi[

j=1 log φj Cj(σS2, GS2)]

i=1 φi ECi[log φi Ci(σS2, GS2)]

i=1 φi log φi +

i=1 φi ECi[ Ci(σS2, GS2)]

i=1 φi log φi +

i=1 φi H(Ci(σS2, GS2))

ELBO(T) = ET [log p S2(σS2, GS2)] + H(T(σS2, GS2))

i=1 φi log φi

i=1 φi [ECi[log p S2(σS2, GS2)] + H(Ci(σS2, GS2))]

i=1 φi log φi +

i=1 φi ELBO(Ci)

In other words, the expectation decomposes as a weighted sum over expectations with respect to the child distributions, and the entropy decomposes as a sum of the entropy of the sum-node weights, and a weighted sum over entropies with respect to the child distributions. Note that the third equality in the derivation of the entropy decomposition holds only due to the fact that Order SPNs are deterministic; this means that the children Ci have disjoint supports, and thus ECi[Cj(σ, G)] = 0 for all i = j. Together, we have that the ELBO of a sum-node can be expressed in terms of the ELBO of its children.

Let P be a product node, associated with ((S1, S21), (S21, S22)), with children C1, C2. P expresses a distribution over (σS2, GS2), where S2 = S21 S22. Then we have that:

EP [log p S2(σS2, GS2)] = EP [log Y

i S2 p Gi(Gi)1GS2|=σS2]

= EP [log Y

i S21 p Gi(Gi)1GS21|=σS21

i S22 p Gi(Gi)1GS22|=σS22]

= EP [log p S21(σS21, GS21)] + EP [log p S22(σS22, GS22)]

H(P(σS2, GS2)) = EP [log P(σS2, GS2)]

= EP [log C1(σS21, GS21) + log C2(σS22, GS22)]

= EC1[log C1(σS21, GS21)] EC2[log C2(σS22, GS22)]

= H(C1(σS21, GS21)) + H(C2(σS22, GS22))

ELBO(P) = EP [log p S2(σS2, GS2)] + H(P(σS2, GS2))

= EP [log p S21(σS21, GS21)] + H(C1(σS21, GS21))

+ EP [log p S22(σS22, GS22)] + H(C2(σS22, GS22))

= ELBO(C1) + ELBO(C2)

This follows from decomposability, which ensures that the child distributions are over disjoint sets of variables (and are thus independent).

By recursively applying the above equalities, we can express the ELBO for the overall Order SPN qφ in terms of the SPN

Tractable Uncertainty for Structure Learning

weights φ and ELBO for the leaf node distributions. Since each equality involves a sum/product over the children of the node (i.e., the outgoing edges), the overall computation takes linear time in the size (number of edges) of the SPN.

E.1. ELBO for Leaf Node Distributions

In the above Proposition, we have not mentioned how to compute the ELBO for the leaf node distributions. For a leaf node L associated with (S1, i), which is a distribution L(Gi) over the parents of variable i, we have that:

ELBO(L) = EL[log p(σ{i}, Gi)] + H(L(σ{i}, Gi))

= EL[log p Gi(Gi)] + H(L(Gi)) (2)

Recall that, for Order SPNs, it is required that L(Gi) has support only over Gi S1. In the main paper, we chose to set L(Gi) p Gi(Gi)1Gi S1. We now provide justiﬁcation for this choice: Proposition 7. L(Gi) p Gi(Gi)1Gi S1 maximizes (2) subject to the support condition.

Proof. The ELBO for a leaf distribution (2) can be written as:

ELBO(L) = EL[log p Gi(Gi)] + H(L(Gi)))

= EL[log p Gi(Gi)] EL[log L(Gi)]

= KL(L||p Gi)

where KL is the KL-divergence. Thus, to maximize the ELBO, we need to minimize this KL-divergence. Let C = P

Gi S1 p Gi(Gi). Assuming L satisﬁes the support condition, this can be written as:

KL(L||p Gi) = EL

log L(Gi) p Gi(Gi)/C

This KL-divergence is minimized by L(Gi) p Gi(Gi)1Gi S1, as required. In this case, the ELBO is given by:

ELBO(L) = log C KL(L||p Gi1Gi S1

We see that, with this choice of L, the ELBO is a constant log C that we can precompute using the methods for computation of leaf distribution described in Appendix B. Thus,

the computation of ELBO for leaf distributions can be done in an O(1) lookup, and the overall ELBO computation is linear in the size of the Order SPN (in particular, independent of the dimension).

F. Experimental Details

Bayesian network hyperparameters In our experiments, we consider linear Gaussian Bayesian networks, and generate Erdos-Renyi random structures, with expected numbers of edges given by 2d. We generate data using ﬁxed observation noise σ2 = 0.1, and edge weights drawn independently from N(0, 1).

Posterior setup We use the fair prior over graph structures (Eggeling et al., 2019), where the prior probability of a mechanism having k edges is proportional to the inverse of the number of different parents sets of size k. In addition, we use the BGe marginal likelihood (Kuipers et al., 2014) with hyperparameters αµ = 1, αw = d + 2, and T = 1

2I where I is the d d identity matrix.

Implementation details Our implementations of DIBS and GADGET are based on the reference implementations with the default settings of hyperparameters. In particular, we ran DIBS with N = 30 particles and 3000 epochs using the marginal inference method, while GADGET was run using 16 coupled chains and for 320000 MCMC iterations, extracting N = 10000 samples.

Our implementation of TRUST uses the Py Torch framework to tensorize passes through the SPN, following the regular Order SPN structure described in the main paper. In the d = 16, 32 cases, we used expansion factors of K = [64, 16, 6, 2], [32, 8, 2, 6, 2] respectively; these were chosen empirically to approximately match oracle computation across layers. Parameter learning in the SPN was performed by optimizing the ELBO objective using the Adam optimizer with learning rate 0.1 and for 700 iterations. Operations in the circuit are performed in log-space for numerical stability.

Inference Queries We perform inference for Di BS and Gadget by applying the appropriate calculation over the sample (for instance, the marginal probability of an edge Gij is simply the proportion of sampled DAGs in which it appears), while for TRUST, we perform inference directly on the Order SPN using the queries described in the paper when this is possible, and by sampling otherwise (e.g. E-SHD).

G. Ablation Study on Order SPN Learning

In Section 5, we proposed to use a two-step procedure for learning Order SPNs, in which we (i) propose a structure for

Tractable Uncertainty for Structure Learning

the Order SPN using an oracle method; and (ii) further learn the parameters of the Order SPN via variational inference. We now perform an ablation study to examine the each of these steps and their impact on performance.

We evaluate ﬁve different methods:

Random In this case, instead of using an oracle method O to split S2 into a partition (S21,i, S22,i, we instead perform this split randomly throughout the Order SPN. We also do not perform any parameter learning, instead setting the parameters at each sum-node in the Order SPN to be equal (e.g. if a sum-node has 4 children, we set each parameter to 0.25).

Parameter Only We randomly propose the structure as above, but do perform parameter learning using VI.

Structure Only We do perform structure learning using GADGET as an oracle, but do not learn parameters.

Gadget As in the main paper.

TRUST-G As in the main paper.

The ﬁrst step of structure learning determines the support of the Order SPN, i.e. the orders and DAGs to which it assigns positive probability, while the second step of parameter learning aims to optimize the ﬁt to the posterior given the support constraints imposed by the ﬁrst step. By randomizing one (or both) of these steps, we can see how this affects the approximation.

The results are shown in Figure 4. As expected, the fully random method performs by far the worst, on all metrics. Both performing parameter learning only and structure learning only provide signiﬁcant improvements, but interestingly on different metrics. Structure learning only performs quite well on AUROC, while parameter learning only performs comparatively better on E-SHD and MLL (even outperforming GADGET on E-SHD). The performance of using parameter learning only is quite remarkable, given that the graphs covered by the Order SPN were chosen at random. We hypothesize that this can be attributed to the compactness and capacity of Order SPNs as a representation; as a result, even the randomly chosen structure will contain some orders/DAGs which are close to the ground truth DAG. Nonetheless, adding structure learning as well, as in TRUSTG, does provide the best overall performance, and shows that both steps are important to obtain the best possible representation.

Tractable Uncertainty for Structure Learning

Figure 4. Ablation study evaluating performance of different variants of TRUST-G (and GADGET), for d = 16.