# decomposing_parameter_estimation_problems__6fc9f838.pdf

Decomposing Parameter Estimation Problems

Khaled S. Refaat, Arthur Choi, Adnan Darwiche Computer Science Department University of California, Los Angeles {krefaat,aychoi,darwiche}@cs.ucla.edu

We propose a technique for decomposing the parameter learning problem in Bayesian networks into independent learning problems. Our technique applies to incomplete datasets and exploits variables that are either hidden or observed in the given dataset. We show empirically that the proposed technique can lead to orders-of-magnitude savings in learning time. We explain, analytically and empirically, the reasons behind our reported savings, and compare the proposed technique to related ones that are sometimes used by inference algorithms.

1 Introduction

Learning Bayesian network parameters is the problem of estimating the parameters of a known structure given a dataset. This learning task is usually formulated as an optimization problem that seeks maximum likelihood parameters: ones that maximize the probability of a dataset.

A key distinction is commonly drawn between complete and incomplete datasets. In a complete dataset, the value of each variable is known in every example. In this case, maximum likelihood parameters are unique and can be easily estimated using a single pass on the dataset. However, when the data is incomplete, the optimization problem is generally non-convex, has multiple local optima, and is commonly solved by iterative methods, such as EM [5, 7], gradient descent [13] and, more recently, EDML [2, 11, 12].

Incomplete datasets may still exhibit a certain structure. In particular, certain variables may always be observed in the dataset, while others may always be unobserved (hidden). We exploit this structure by decomposing the parameter learning problem into smaller learning problems that can be solved independently. In particular, we show that the stationary points of the likelihood function can be characterized by the ones of the smaller problems. This implies that algorithms such as EM and gradient descent can be applied to the smaller problems while preserving their guarantees. Empirically, we show that the proposed decomposition technique can lead to orders-of-magnitude savings. Moreover, we show that the savings are ampliﬁed when the dataset grows in size. Finally, we explain these signiﬁcant savings analytically by examining the impact of our decomposition technique on the dynamics of the used convergence test, and on the properties of the datasets associated with the smaller learning problems.

The paper is organized as follows. In Section 2, we provide some background on learning Bayesian network parameters. In Section 3, we present the decomposition technique and then prove its soundness in Section 4. Section 5 is dedicated to empirical results and to analyzing the reported savings. We discuss related work in Section 6 and ﬁnally close with some concluding remarks in Section 7. The proofs are moved to the appendix in the supplementary material.

2 Learning Bayesian Network Parameters

We use upper case letters (X) to denote variables and lower case letters (x) to denote their values. Variable sets are denoted by bold-face upper case letters (X) and their instantiations by bold-face lower case letters (x). Generally, we will use X to denote a variable in a Bayesian network and U to denote its parents.

A Bayesian network is a directed acyclic graph with a conditional probability table (CPT) associated with each node X and its parents U. For every variable instantiation x and parent instantiation u, the CPT of X includes a parameter θx|u that represents the probability Pr(X =x|U=u). We will use θ to denote the set of all network parameters. Parameter learning in Bayesian networks is the process of estimating these parameters θ from a given dataset.

A dataset is a multi-set of examples. Each example is an instantiation of some network variables. We will use D to denote a dataset and d1, . . . , d N to denote its N examples. The following is a dataset over four binary variables ( ? indicates a missing value of a variable in an example):

example E B A C d1 e b a ? d2 ? b a ? d3 e b a ?

A variable X is observed in a dataset iff the value of X is known in each example of the dataset (i.e., ? cannot appear in the column corresponding to variable X). Variables A and B are observed in the above dataset. Moreover, a variable X is hidden in a dataset iff its value is unknown in every example of the dataset (i.e., only ? appears in the column of variable X). Variable C is hidden in the above dataset. When all variables are observed in a dataset, the dataset is said to be complete. Otherwise, the dataset is incomplete. The above dataset is incomplete.

Given a dataset D with examples d1, . . . , d N, the likelihood of parameter estimates θ is deﬁned as:

L(θ|D) = QN i=1 Pr θ(di).

Here, Pr θ is the distribution induced by the network structure and parameters θ. One typically seeks maximum likelihood parameters θ = argmax θ L(θ|D).

When the dataset is complete, maximum likelihood estimates are unique and easily obtainable using a single pass over the dataset (e.g., [3, 6]). For incomplete datasets, the problem is generally nonconvex and has multiple local optima. Iterative algorithms are usually used in this case to try to obtain maximum likelihood estimates. This includes EM [5, 7], gradient descent [13], and the more recent EDML algorithm [2, 11, 12]. The ﬁxed points of these algorithms correspond to the stationary points of the likelihood function. Hence, these algorithms are not guaranteed to converge to global optima. As such, they are typically applied to multiple seeds (initial parameter estimates), while retaining the best estimates obtained across all seeds.

3 Decomposing the Learning Problem

We now show how the problem of learning Bayesian network parameters can be decomposed into independent learning problems. The proposed technique exploits two aspects of a dataset: hidden and observed variables.

Proposition 1 The likelihood function L(θ|D) does not depend on the parameters of variable X if X is hidden in dataset D and is a leaf of the network structure.

If a hidden variable appears as a leaf in the network structure, it can be removed from the structure while setting its parameters arbitrarily (assuming no prior). This process can be repeated until there are no leaf variables that are also hidden. The soundness of this technique follows from [14, 15].

Figure 1: Identifying components of network G given O = {V, X, Z}.

Our second decomposition technique will exploit the observed variables of a dataset. In a nutshell, we will (a) decompose the Bayesian network into a number of sub-networks, (b) learn the parameters of each sub-network independently, and then (c) assemble parameter estimates for the original network from the estimates obtained in each sub-network.

Deﬁnition 1 (Component) Let G be a network, O be some observed variables in G and let G|O be the network which results from deleting all edges from G which are outgoing from O. A component of G|O is a maximal set of nodes that are connected in G|O.

Consider the network G in Figure 1, with observed variables O = {V, X, Z}. Then G|O has three components in this case: S1 = {V }, S2 = {X}, and S3 = {Y, Z}.

The components of a network partition its parameters into groups, one group per component. In the above example, the network parameters are partitioned into the following groups:

S1 : {θv, θv} S2 : {θx|v, θx|v, θx|v, θx|v}

S3 : {θy|x, θy|x, θy|x, θy|x, θz|y, θz|y, θz|y, θz|y}.

We will later show that the learning problem can be decomposed into independent learning problems, each induced by one component. To deﬁne these independent problems, we need some deﬁnitions.

Deﬁnition 2 (Boundary Node) Let S be a component of G|O. If edge B S appears in G, B S and S S, then B is called a boundary for component S.

Considering Figure 1, node X is the only boundary for component S3 = {Y, Z}. Moreover, node V is the only boundary for component S2 = {X}. Component S1 = {V } has no boundary nodes.

The independent learning problems are based on the following sub-networks.

Deﬁnition 3 (Sub-Network) Let S be a component of G|O with boundary variables B. The sub-network of component S is the subset of network G induced by variables S B.

Figure 2 depicts the three sub-networks which correspond to our running example.

Figure 2: The sub-networks induced by adding boundary variables to components.

The parameters of a sub-network will be learned using projected datasets.

Deﬁnition 4 Let D = d1, . . . , d N be a dataset over variables X and let Y be a subset of variables X. The projection of dataset D on variables Y is the set of examples e1, . . . , e N, where each ei is the subset of example di which pertains to variables Y.

We show below a dataset for the full Bayesian network in Figure 1, followed by three projected datasets, one for each of the sub-networks in Figure 2.

V X Y Z d1 v x ? z d2 v x ? z d3 v x ? z

V count e1 v 1 e2 v 2

V X count e1 v x 1 e2 v x 1 e3 v x 1

X Y Z count e1 x ? z 2 e2 x ? z 1

The projected datasets are compressed as we only represent unique examples, together with a count of how many times each example appears in a dataset. Using compressed datasets is crucial to realizing the full potential of decomposition, as it ensures that the size of a projected dataset is at most exponential in the number of variables appearing in its sub-network (more on this later).

We are now ready to describe our decomposition technique. Given a Bayesian network structure G and a dataset D that observes variables O, we can get the stationary points of the likelihood function for network G as follows:

1. Identify the components S1, . . . , SM of G|O (Deﬁnition 1). 2. Construct a sub-network for each component Si and its boundary variables Bi (Deﬁnition 3). 3. Project the dataset D on the variables of each sub-network (Deﬁnition 4). 4. Identify a stationary point for each sub-network and its projected dataset (using, e.g., EM, EDML or gradient descent). 5. Recover the learned parameters of non-boundary variables from each sub-network.

We will next prove that (a) these parameters are a stationary point of the likelihood function for network G, and (b) every stationary point of the likelihood function can be generated this way (using an appropriate seed).

4 Soundness

The soundness of our decomposition technique is based on three steps. We ﬁrst introduce the notion of a parameter term, on which our proof rests. We then show how the likelihood function for the Bayesian network can be decomposed into component likelihood functions, one for each subnetwork. We ﬁnally show that the stationary points of the likelihood function (network) can be characterized by the stationary points of component likelihood functions (sub-networks).

Two parameters are compatible iff they agree on the state of their common variables. For example, parameters θz|y and θy|x are compatible, but parameters θz|y and θy|x are not compatible, as y = y. Moreover, a parameter is compatible with an example iff they agree on the state of their common variables. Parameter θy|x is compatible with example x, y, z, but not with example x, y, z.

Deﬁnition 5 (Parameter Term) Let S be network variables and let d be an example. A parameter term for S and d, denoted Θd S, is a product of compatible network parameters, one for each variable in S, that are also compatible with example d.

Consider the network X Y Z. If S = {Y, Z} and d = x, z, then Θd S will denote either θy|xθz|y or θy|xθz|y. Moreover, if S = {X, Y, Z}, then Θd S will denote either θxθy|xθz|y or θxθy|xθz|y. In this case, Pr(d) = P

Θd S Θd S. This holds more generally, whenever S is the set of all network variables.

We will now use parameter terms to show how the likelihood function can be decomposed into component likelihood functions.

Theorem 1 Let S be a component of G|O and let R be the remaining variables of network G. If variables O are observed in example d, we have

If θ denotes all network parameters, and S is a set of network variables, then θ:S will denote the subset of network parameters which pertain to the variables in S. Each component S of a Bayesian network induces its own likelihood function over parameters θ:S.

Deﬁnition 6 (Component Likelihood) Let S be a component of G|O. For dataset D = d1, . . . , d N, the component likelihood for S is deﬁned as

In our running example, the components are S1 = {V }, S2 = {X} and S3 = {Y, Z}. Moreover, the observed variables are O = {V, X, Z}. Hence, the component likelihoods are

L(θ:S1|D) = [θv] [θv] [θv]

L(θ:S2|D) = θx|v θx|v θx|v

L(θ:S3|D) = θy|xθz|y + θy|xθz|y θy|xθz|y + θy|xθz|y θy|xθz|y + θy|xθz|y

The parameters of component likelihoods partition the network parameters. That is, the parameters of two component likelihoods are always non-overlapping. Moreover, the parameters of component likelihoods account for all network parameters.1

We can now state our main decomposition result, which is a direct corollary of Theorem 1.

Corollary 1 Let S1, . . . , SM be the components of G|O. If variables O are observed in dataset D,

i=1 L(θ:Si|D).

Hence, the network likelihood decomposes into a product of component likelihoods. This leads to another important corollary (see Lemma 1 in the Appendix):

Corollary 2 Let S1, . . . , SM be the components of G|O. If variables O are observed in dataset D, then θ is a stationary point of the likelihood L(θ|D) iff, for each i, θ :Si is a stationary point for the component likelihood L(θ:Si|D).

The search for stationary points of the network likelihood is now decomposed into independent searches for stationary points of component likelihoods.

We will now show that the stationary points of a component likelihood can be identiﬁed using any algorithm that identiﬁes such points for the network likelihood.

Theorem 2 Consider a sub-network G which is induced by component S and boundary variables B. Let θ be the parameters of sub-network G, and let D be a dataset for G that observes boundary variables B. Then θ is a stationary point for the sub-network likelihood, L(θ|D), only if θ :S is a stationary point for the component likelihood L(θ:S|D). Moreover, every stationary point for L(θ:S|D) is part of some stationary point for L(θ|D).

Given an algorithm that identiﬁes stationary points of the likelihood function of Bayesian networks (e.g., EM), we can now identify all stationary points of a component likelihood. That is, we just apply this algorithm to the sub-network of each component S, and then extract the parameter estimates of variables in S while ignoring the parameters of boundary variables. This proves the soundness of our proposed decomposition technique.

5 The Computational Beneﬁt of Decomposition

We will now illustrate the computational beneﬁts of the proposed decomposition technique, showing orders-of-magnitude reductions in learning time. Our experiments are structured as follows. Given a Bayesian network G, we generate a dataset D while ensuring that a certain percentage of variables are observed, with all others hidden. Using dataset D, we estimate the parameters of network G using two methods. The ﬁrst uses the classical EM on network G and dataset D. The second decomposes network G into its sub-networks G1, . . . , GM, projects the dataset D on each subnetwork, and then applies EM to each sub-network and its projected dataset. This method is called D-EM (for Decomposed EM). We use the same seed for both EM and D-EM.

Before we present our results, we have the following observations on our data generation model. First, we made all unobserved variables hidden (as opposed to missing at random) as this leads to a more difﬁcult learning problem, especially for EM (even with the pruning of hidden leaf nodes).

1The sum-to-one constraints that underlie each component likelihood also partition the sum-to-one constraints of the likelihood function.

50 60 70 80 9095 0

50 60 70 80 9095 0

Figure 3: Speed-up of D-EM over EM on chain networks: three chains (180, 380, and 500 variables) (left), and tree networks (63, 127, 255, and 511 variables) (right), with three random datasets per network/observed percentage, and 210 examples per dataset.

Observed % Network Speed-up Network Speed-up Network Speed-up D-EM D-EM D-EM 95.0% alarm 267.67x diagnose 43.03x andes 155.54x 90.0% alarm 173.47x diagnose 17.16x andes 52.63x 80.0% alarm 115.4x diagnose 11.86x andes 14.27x 70.0% alarm 87.67x diagnose 3.25x andes 2.96x 60.0% alarm 92.65x diagnose 3.48x andes 0.77x 50.0% alarm 12.09x diagnose 3.73x andes 1.01x 95.0% win95pts 591.38x water 811.48x pigs 235.63x 90.0% win95pts 112.57x water 110.27x pigs 37.61x 80.0% win95pts 22.41x water 7.23x pigs 34.19x 70.0% win95pts 17.92x water 1.5x pigs 16.23x 60.0% win95pts 4.8x water 2.03x pigs 4.1x 50.0% win95pts 7.99x water 4.4x pigs 3.16x

Table 1: Speed-up of D-EM over EM on UAI networks. Three random datasets per network/observed percentage with 210 examples per dataset.

Second, it is not uncommon to have a signiﬁcant number of variables that are always observed in real-world datasets. For example, in the UCI repository: the internet advertisements dataset has 1558 variables, only 3 of which have missing values; the automobile dataset has 26 variables, where 7 have missing values; the dermatology dataset has 34 variables, where only age can be missing; and the mushroom dataset has 22 variables, where only one variable has missing values [1].

We performed our experiments on three sets of networks: synthesized chains, synthesized complete binary trees, and some benchmarks from the UAI 2008 evaluation with other standard benchmarks (called UAI networks): alarm, win95pts, andes, diagnose, water, and pigs. Figure 3 and Table 1 depict the obtained time savings. As can be seen from these results, decomposing chains and trees lead to two orders-of-magnitude speed-ups for almost all observed percentages. For UAI networks, when observing 70% of the variables or more, one obtains one-to-two orders-of-magnitude speedups. We note here that the time used for D-EM includes the time needed for decomposition (i.e., identifying the sub-networks and their projected datasets). Similar results for EDML are shown in the supplementary material.

The reported computational savings appear quite surprising. We now shed some light on the culprit behind these savings. We also argue that some of the most prominent tools for Bayesian networks do not appear to employ the proposed decomposition technique when learning network parameters.

Our ﬁrst analytic explanation for the obtained savings is based on understanding the role of data projection, which can be illustrated by the following example. Consider a chain network over binary variables X1, . . . , Xn, where n is even. Consider also a dataset D in which variable Xi is observed for all odd i. There are n/2 sub-networks in this case. The ﬁrst sub-network is X1. The remaining sub-networks are in the form Xi 1 Xi Xi+1 for i = 2, 4, . . . , n 2 (node Xn will be pruned). The dataset D can have up to 2n/2 distinct examples. If one learns parameters without decomposition, one would need to call the inference engine once for each distinct example, in each iteration of the learning algorithm. With m iterations, the inference engine may be called up to m2n/2 times. When learning with decomposition, however, each projected dataset will have

8 10 12 14 16 0

Dataset Size

0 200 400 0

Sub network

# iterations

0 200 400 0

Sub network

# iterations

Figure 4: Left: Speed-up of D-EM over EM as a function of dataset size. This is for a chain network with 180 variables, while observing 50% of the variables. Right Pair: Graphs showing the number of iterations required by each sub-network, sorted descendingly. The problem is for learning Network Pigs while observing 90% of the variables, with convergence based on parameters (left), and on likelihood (right).

at most 2 distinct examples for sub-network X1, and at most 4 distinct examples for sub-network Xi 1 Xi Xi+1 (variable Xi is hidden, while variables Xi 1 and Xi+1 are observed). Hence, if sub-network i takes mi iterations to converge, then the inference engine would need to be called at most 2m1+4(m2+m4+. . .+mn 2) times. We will later show that mi is generally signiﬁcantly smaller than m. Hence, with decomposed learning, the number of calls to the inference engine can be signiﬁcantly smaller, which can contribute signiﬁcantly to the obtained savings. 2

8 10 12 14 10 0

Dataset Size

SMILE SAMIAM D EM

Figure 5: Effect of dataset size (log-scale) on learning time in seconds.

Our analysis suggests that the savings obtained from decomposing the learning problem would amplify as the dataset gets larger. This can be seen clearly in Figure 4 (left), which shows that the speed-up of D-EM over EM grows linearly with the dataset size. Hence, decomposition can be critical when learning with very large datasets.

Interestingly, two of the most prominent (noncommercial) tools for Bayesian networks do not exhibit this behavior on the chain network discussed above. This is shown in Figure 5, which compares D-EM to the EM implementations of the GENIE/SMILE and SAMIAM systems,3 both of which were represented in previous inference evaluations [4]. In particular, we ran these systems on a chain network X0 X100, where each variable has 10 states, and using datasets with alternating observed and hidden variables. Each plot point represents an average over 20 simulated datasets, where we recorded the time to execute each EM algorithm (excluding the time to read networks and datasets from ﬁle, which was negligible compared to learning time).

Clearly, D-EM scales better in terms of time than both SMILE and SAMIAM, as the size of the dataset increases. As explained in the above analysis, the number of calls to the inference engine by D-EM is not necessarily linear in the dataset size. Note here that D-EM used a stricter convergence threshold and obtained better likelihoods, than both SMILE and SAMIAM, in all cases. Yet, D-EM was able to achieve one-to-two orders-of-magnitude speed-ups as the dataset grows in size. On the other hand, SAMIAM was more efﬁcient than SMILE, but got worse likelihoods in all cases, using their default settings (the same seed was used for all algorithms).

Our second analytic explanation for the obtained savings is based on understanding the dynamics of the convergence test, used by iterative algorithms such as EM. Such algorithms employ a convergence test based on either parameter or likelihood change. According to the ﬁrst test, one compares the parameter estimates obtained at iteration i of the algorithm to those obtained at itera-

2The analysis in this section was restricted to chains to make the discussion concrete. This analysis, however, can be generalized to arbitrary networks if enough variables are observed in the corresponding dataset. 3Available at http://genie.sis.pitt.edu/ and http://reasoning.cs.ucla.edu/samiam/. SMILE s C++ API was used to run EM, using default options, except we suppressed the randomized parameters option. SAMIAM s Java API was used to run EM (via the Code Bandit feature), also using default options, and the Hugin algorithm as the underlying inference engine.

tion i 1. If the estimates are close enough, the algorithm converges. The likelihood test is similar, except that the likelihood of estimates is compared across iterations. In our experiments, we used a convergence test based on parameter change. In particular, when the absolute change in every parameter falls below the set threshold of 10 4, convergence is declared by EM.

When learning with decomposition, each sub-network is allowed to converge independently, which can contribute signiﬁcantly to the obtained savings. In particular, with enough observed variables, we have realized that the vast majority of sub-networks converge very quickly, sometimes in one iteration (when the projected dataset is complete). In fact, due to this phenomenon, the convergence threshold for sub-networks can be further tightened without adversely affecting the total running time. In our experiments, we used a threshold of 10 5 for D-EM, which is tighter than the threshold used for EM. Figure 4 (right pair) illustrates decomposed convergence, by showing the number of iterations required by each sub-network to converge, sorted decreasingly, with convergence test based on parameters (left) and likelihood (right). The vast majority of sub-networks converged very quickly. Here, convergence was declared when the change in parameters or log-likelihood, respectively, fell below the set threshold of 10 5.

6 Related Work

The decomposition techniques we discussed in this paper have long been utilized in the context of inference, but apparently not in learning. In particular, leaf nodes that do not appear in evidence e have been called Barren nodes in [14], which showed the soundness of their removal during inference with evidence e. Similarly, deleting edges outgoing from evidence nodes has been called evidence absorption and its soundness was shown in [15]. Interestingly enough, both of these techniques are employed by the inference engines of SAMIAM and SMILE,4 even though neither seem to employ them when learning network parameters as we propose here (see earlier experiments). When employed during inference, these techniques simplify the network to reduce the time needed to compute queries (e.g., conditional marginals which are needed by learning algorithms). However, when employed in the context of learning, these techniques reduce the number of calls that need to be made to an inference engine. The difference is therefore fundamental, and the effects of the techniques are orthogonal. In fact, the inference engine we used in our experiments does employ decomposition techniques. Yet, we were still able to obtain orders-of-magnitude speed-ups when decomposing the learning problem. On the other hand, our proposed decomposition techniques do not apply fully to Markov random ﬁelds (MRFs) as the partition function cannot be decomposed, even when the data is complete (evaluating the partition function is independent of the data). However, distributed learning algorithms have been proposed in the literature. For example, the recently proposed LAP algorithm is a consistent estimator for MRFs under complete data [10]. A similar method to LAP was independently introduced by [9] in the context of Gaussian graphical models.

7 Conclusion

We proposed a technique for decomposing the problem of learning Bayesian network parameters into independent learning problems. The technique applies to incomplete datasets and is based on exploiting variables that are either hidden or observed. Our empirical results suggest that orders-ofmagnitude speed-up can be obtained from this decomposition technique, when enough or particular variables are hidden or observed in the dataset. The proposed decomposition technique is orthogonal to the one used for optimizing inference as one reduces the time of inference queries, while the other reduces the number of such queries. The latter effect is due to decomposing the dataset and the convergence test. The decomposition process incurs little overhead as it can be performed in time that is linear in the structure size and dataset size. Hence, given the potential savings it may lead to, it appears that one must always try to decompose before learning network parameters.

Acknowledgments

This work has been partially supported by ONR grant #N00014-12-1-0423 and NSF grant #IIS1118122.

4SMILE actually employs a more advanced technique known as relevance reasoning [8].

[1] K. Bache and M. Lichman. UCI machine learning repository. Technical report, Irvine, CA: University of California, School of Information and Computer Science, 2013.

[2] Arthur Choi, Khaled S. Refaat, and Adnan Darwiche. EDML: A method for learning parameters in Bayesian networks. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2011.

[3] Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.

[4] Adnan Darwiche, Rina Dechter, Arthur Choi, Vibhav Gogate, and Lars Otten. Results from the probabilistic inference evaluation of uncertainty in artiﬁcial intelligence UAI-08. http://graphmod.ics.uci.edu/uai08/Evaluation/Report, 2008.

[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1 38, 1977.

[6] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.

[7] S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191 201, 1995.

[8] Yan Lin and Marek Druzdzel. Computational advantages of relevance reasoning in Bayesian belief networks. In Proceedings of the Thirteenth Conference on Uncertainty in Artiﬁcial Intelligence, 1997.

[9] Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Distributed learning of Gaussian graphical models via marginal likelihoods. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2013.

[10] Yariv Dror Mizrahi, Misha Denil, and Nando de Freitas. Linear and parallel learning of Markov random ﬁelds. In International Conference on Machine Learning (ICML), 2014.

[11] Khaled S. Refaat, Arthur Choi, and Adnan Darwiche. New advances and theoretical insights into EDML. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, pages 705 714, 2012.

[12] Khaled S. Refaat, Arthur Choi, and Adnan Darwiche. EDML for learning parameters in directed and undirected graphical models. In Neural Information Processing Systems, 2013.

[13] S. Russel, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artiﬁcial Intelligence, 1995.

[14] R. Shachter. Evaluating inﬂuence diagrams. Operations Research, 1986.

[15] R. Shachter. Evidence absorption and propagation through evidence reversals. In Proceedings of the Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 1989.