# computing_divergences_between_discrete_decomposable_models__45aafd27.pdf

Computing Divergences between Discrete Decomposable Models

Loong Kuan Lee1, Nico Piatkowski2, Franc ois Petitjean1, and Geoffrey I. Webb1

1 Department of Data Science and AI, Monash University, Melbourne, Australia 2Fraunhofer IAIS, Schloss Birlinghoven, 53757 Sankt Augustin, Germany mail@lklee.dev

There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alphabeta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.

Introduction Computing the divergence, i.e. the degree of difference , between two joint probability distributions is a problem that has many applications in the field of Machine Learning. For instance, it can be used to estimate the divergence between the underlying distributions of two data samples. This particular application is useful in the study of changing distributions, i.e. concept drift (Schlimmer and Granger 1986; Webb et al. 2018), in the detection of anomalous regions in spatiotemporal data (Barz et al. 2019; Piatkowski, Lee, and Morik 2013), and in tasks related to the retrieval, classification, and visualisation of time series data (Chen, Ye, and Li 2020). Although there has been much work in estimating the divergence between 2 general high-dimensional discrete distributions (Bhattacharya, Kar, and Pal 2009; Abdullah et al. 2016), they do not compute the exact divergence between these distributions as it is intractable to do so without any knowledge or assumptions made regarding the structure within these distributions. Instead, approaches that do take advantage of some structural properties within the distributions for an efficient computation of divergences have appeared in the literature before, e.g., for computing the Kullback-Leibler (KL) divergence between Bayesian networks (BNs) (Moral, Cano, and G omez-Olmedo 2021). It

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

is also possible to tractably compute the KL divergence between a general Markov network (MN) and a MN where inference tasks are tractable (Koller and Friedman 2009). However, there are situations where one might want to compute divergences other than the KL divergence (Nowozin, Cseke, and Tomioka 2016), in particular in the variational inference community where they have been employed to derive alternative evidence lower bounds (Chen et al. 2018; Li and Turner 2016; Dieng et al. 2017) or in the context of generative models (Genevay, Peyre, and Cuturi 2018). Furthermore, in natural language processing, using the KL divergence is problematic in the presence of uneven word frequencies (Labeau and Cohen 2019). Even for fundamental problems like model selection, we show that considering different types of divergences can be beneficial. Motivated by these considerations, in this paper we show how to compute a wide family of divergences, the αβdivergences, between two decomposable models (DMs). In the process of showing how the αβ-divergence can be computed between any two DMs, we will reach a more general result. That is, we will show how one can compute, between two DMs, the functional F defined in Definition 1: Definition 1. (Functional F) F(P, Q; g, h, g , h )

g P (x) h Q (x) L [g [P](x)] [h [Q](x)]

where, for any distribution P of a DM with graph structure G, L is any function with the property L (Q r r) = P r L(r), and f {g, h, g , h } are functionals with the property:

C C(G) f PC (x C) (1)

An example of such a functional is the power functional: [Q

C C(G) PC(x C)]2 = Q

C C(G) PC(x C)2. This result implies the possibility for the computation of divergences and functionals other than the αβ-divergence between two DMs. In fact, we show that F can be computed by running the junction tree algorithm (JTA) over a specifically constructed chordal graph and set of initial factors. Proofs for our contributed theoretical results are deferred to the technical appendix of the extended version of this paper at: https://arxiv.org/abs/2112.04583

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Background and Notation Let us summarize the notation and background necessary for the subsequent development.

Markov Networks (MNs) An undirected graph G = (V, E) consists of n = |V | vertices, connected via edges (v, w) E. For two graphs G1, G2, we write V (G1) and V (G2) to denote the vertices of G1 and G2, respectively and similar E(G1) and E(G2) for the edges. A clique C is a fully-connected subset of vertices, i.e., v, w C : (v, w) E. The set of all cliques of G is denoted by C(G). Here, any undirected graph represents the conditional independence structure of an undirected graphical model or MN (Wainwright and Jordan 2008). To this end, we identify each vertex v V with a random variable Xv taking values in the state space Xv = Dom(Xv). The random vector X = (Xv : v V ), with probability mass function (pmf) P, represents the random joint state of all vertices in some arbitrary but fixed order, taking values x in the Cartesian product space X = Dom(X) = N v V Xv. If not stated otherwise, X is a discrete set. Moreover, we allow to access these quantities for any proper subset of variables S V , i.e., XS = (Xv : v S), x S, and XS = N v S Xv, respectively. We write ω(G) to indicate the treewidth of G, i.e. ω(G) = max C C(G) |C| 1. According to the Hammersley-Clifford theorem (Hammersley and Clifford 1971), the probability mass of X factorizes over positive functions ψC : X R+, one for each maximal clique of the underlying graph,

P(X = x) = 1

C C ψC(x C) , (2)

normalized via Z = P x X Q C C ψC(x C). Due to positivity of ψC, it can be written as an exponential, i.e., ψC(x C) = exp( θC, ϕC(x C) ) with sufficient statistic ϕC : XC R|XC|. The overcomplete sufficient statistic of discrete data is a one-hot vector that selects a specific weight value, e.g., ψC(x C) = exp(θC=x C). The full joint can be written in the famous exponential family form P(X = x) = exp( θ, ϕ(x) log Z) with θ = (θC : C C) and ϕ(x) = (ϕC(x C) : C C). The parameters of exponential family members are estimated by minimizing the negative average log-likelihood ℓ(θ; D) = (1/|D|) P x D log Pθ(x) for some data set D via first-order numeric optimization methods. D contains samples from X, and it can be shown that the estimated probability mass converges to the data generating distribution as the size of D increases. However, computing Z and hence performing probabilistic inference is #P-hard (Valiant 1979; Bulatov and Grohe 2004). There are approximation techniques for inference with quality guarantees (Piatkowski and Morik 2018), but for exact inference, the junction tree algorithm is needed. The junction tree representation of an undirected model is a tree, in which each vertex represents a maximal clique of a triangulation1 of G (Wainwright and

1A triangulation of a graph G = (V, E) is another graph G = (V, E ) with E E , such that G is a chordal graph.

Initial State

After Junction Tree Algorithm

x ψb(x)ψc(x)ψd(x)ψe(x) Y

Figure 1: Illustration of the Junction Tree Algorithm.

Jordan 2008, Sec. 2.5.2). The cutset of each pair of adjacent clique-vertices is called a separator. Nevertheless, junction trees require the underlying graphical structure of the graphical model to be decomposable.

Decomposable Models (DMs) A DM, PG, is a MN where the underlying conditional independence structure, G, is a chordal graph 2. DMs can be translated directly into an equivalent junction tree representation by finding the maximum spanning tree of its clique graph . Each vertex of the clique graph is a maximal clique in the DM and each edge is the separator between the vertex. The weight of each edge is then the number of variables in the corresponding separator. The resulting junction tree, T = (C, S), will have vertices that are the maximal cliques, C, and edges that are the minimal separators, S, of the DM. Beside allowing for fast inference, another benefit of a DM is that there is a closed form solution to the maximum likelihood parmeter estimation problem for the joint distribution over all the variables in the model (Haberman 1977). Therefore, the joint distribution for the DM PG is:

PG(x) = Q C C PC(x) Q S S PS(x)

where Pd( ) represents the marginal probability over Xd. Alternatively, we can also represent the joint distribution of PG as a product of conditional probability tables (CPTs) if we choose a maximal clique in C to be the root node of PG s junction tree T .

C C PC pa(C)|pa(C)(x C pa(C)|xpa(C)) = Y

C C PT C (x)

where PT C (x) = PC pa(C)|pa(C)(x C pa(C)|xpa(C)) and pa(C) is the parent clique of C in the junction tree T . pa(C) = when C is the assigned root node of T .

Junction Tree Algorithms (JTAs) Recall the partition function of a general MN Z: Z = P x X Q C C ψC(x C). Evaluating the partition function of loopy models exactly does not necessarily require a naive

2A graph is cordial if every induced cycle has exactly 3 vertices.

(a) (b) (c)

Figure 2: (a) a non-chordal MN, (b) a possible triangulation of this MN, (c) a minimal triangulation of this MN

summation over the state space X; there is another, more efficient, technique. Any loopy graph can be triangulated and converted into a chordal graph with a junction tree representation. (Lauritzen and Spiegelhalter 1988; Wainwright and Jordan 2008). Then, as illustrated in Figure 1, the junction tree algorithm goes a step further and computes the un-normalized marginal probability , β, for each maximal clique in the JT (Koller and Friedman 2009, Corollary 10.2):

C C : βC(x C) = X

C C ψC(x, x C)

As with belief propagation in ordinary trees, inference on the junction tree has a time complexity that is polynomial in the maximal state space size of its vertices. The maximal vertex state space size of a junction tree is, however, exponential in the treewidth of the triangulation of G used. Hence, if the treewidth of the triangulation of a loopy model is small, exact inference via the junction tree algorithm is rather efficient. Choosing a triangulation that results in a minimal treewidth is an NP-hard problem, but a valid triangulation can be found with time and memory complexity linear in the number of vertices (Dechter 2003; Berry et al. 2004; Heggernes 2006). See Figure 2 for an example of a non-minimal and minimal triangulation of a graph.

αβ-Divergence between DMs A divergence is a measure of the difference between 2 probability distributions. More formally, a divergence is a function between 2 distributions as defined in Definition 2.

Definition 2. (Divergence) Suppose S is the set of probability distributions with the same support. A divergence, D, is the function D( || ) : S S R such that P, Q S : D(P || Q) 0 and P = Q D(P || Q) = 0 3.

Furthermore, there are also generalized divergences where common divergences, such as the KL divergence, are special cases of the generalized divergence. Specifically, we will use the generalized divergence known as the αβdivergence (Cichocki, Cruces, and Amari 2011).

Definition 3 (αβ-divergence). The αβ-divergence, DAB, between 2 positive measures P and Q is defined by the follow-

3Some authors also require that the quadratic part of the Taylor expansion of D(p, p+dp) define a Riemannian metric on S (Amari 2016). However, this requirement is not needed by the methods described in this paper and is therefore left out.

ing, where α and β are parameters:

Dα,β AB (P, Q) = X

x X dα,β AB P(x), Q(x) (3)

where (Cichocki, Cruces, and Amari 2011):

d(α,β) AB (P(x), Q(x)) (4)

αβ P(x)αQ(x)β αP(x)α+β

α+β βQ(x)α+β

for α, β, α + β = 0

1 α2 P(x)α log P(x)α

Q(x)α P(x)α + Q(x)α

for α = 0, β = 0

P(x)α + Q(x)α

for α = β = 0

1 β2 Q(x)β log Q(x)β

P(x)β Q(x)β + P(x)β

for α = 0, β = 0 1 2(log P(x) log Q(x))2 for α, β = 0.

The parameters α and β in the αβ-divergence is used to express other commonly used divergences. Specifically, the α = 1, β = 0 gives the KL divergence, while the α = 0.5, β = 0.5 gives the Bhattacharyya coefficient which immediately gives the Hellinger distance. The expression of the αβ-divergence in Equation 4 can be expressed as a linear combination of 3 smaller functionals. Theorem 1. The 5 cases of the αβ-divergence in Equation 4 are linear combinations of the following 3 functionals:

f1(P, Q) = X

1 2(log P(x) log Q(x))2

f2(P, Q; a, b) = X

x X P(x)a Q(x)b

f3(P, Q; a, b, c, d) = X

x X P(x)a Q(x)b log(P(x)c Q(x)d)

Therefore, the ability to tractably compute these functionals between 2 DMs will imply the ability to tractably compute the αβ-divergence between 2 DMs. Here we assume a complexity exponential to the treewidth of our DMs is tractable. Theorem 2. The time complexity for computing the functional f1 directly between 2 DMs is O(n2ω2ω+1) where ω(G) is the treewidth of chordal graph G, ω = max(ω(GP), ω(GQ)), and n is the number of variables. Since computing f1 directly is tractable, the focus of the rest of this paper will be to show how to compute functionals f2 and f3 between 2 DMs. In order to simplify further exposition, it will be ideal if functionals f2 and f3 can be expressed by a single, more general, functional. Theorem 3. f2 can be expressed by functional F. Theorem 4. f3 can be expressed by functional F. Therefore, any method that can tractably compute F, as defined in Definition 1, between 2 DMs can also tractably compute the αβ-divergence between these models.

With reasoning for the definition of F established, we can now substitute the maximum likelihood estimator of DMs PGP and QGQ into functional F. But before we start, first recall the notation established in Section :

C C P x C pa(C) | xpa(C) = Y

C C PT C (x C)

where, for C X : PT C (x X) = PT C (x C)

and pa(C) is the parent of the maximal clique C in the junction tree of P s and Q s respective chordal graph. Then continuing with the substitution we get:

F(P, Q; g, h, g , h ) (5)

g [P] (x) h [Q] (x) L g [P] (x) h [Q] (x)

x X L g PT C (x) g [P] (x) h [Q] (x) #

x X L h QT C (x) g [P] (x) h [Q] (x)

x C XC L g PT C (x C) SPC(x C)+

x C XC L h QT C (x C) SPC(x C)

where, for ease of notation:

g [P] (x C, x) h [Q] (x C, x)

C CP g PT C (x)

C CQ h QT C (x)

which represents the marginalisation of all the variables that are not in the clique C over the all the non-log factors produced by F. The equality in Equation 5 holds mainly due to the associativity of summations. Remark 1. The lower bound complexity of directly computing Equation 5 is Ω(2n) where n is the number of variables. Therefore directly computing the functional F between 2 DMs is intractable. Consequently, in order to compute F(P, Q), and therefore DAB(P, Q), while avoiding complexity exponential to n, we require a more sophisticated method for its computation.

Computing Functional F between DMs In order to tractably compute F between DMs PGP and QGQ, and therefore the αβ-divergence, we first require knowledge of a computation graph between DMs PGP and QGQ. Definition 4 (strictly larger, clique mapping α). A chordal graph H is strictly larger than chordal graphs GP and GQ if all the maximal cliques in both chordal graphs is either a subset or equal to a maximal clique in H. In other words, H

Initial State

After Junction Tree Algorithm

CP Ai(C(GP)) g PT CP Y

CQ Ai(C(GQ)) h QT CQ

where : Ai(C) = {C : C C α(C) == Ci}

x X ψb(x)ψc(x)ψd(x)ψe(x) Y

Figure 3: Junction Tree Algorithm to compute the functional F between 2 DMs PGP and QGQ using computation graph H, assuming H is a connected graph.

is strictly larger than GP and GQ if and only if there exists a mapping α such that:

α : C(GP, GQ) C(H) s.t. C C(GP, GQ) : C α(C)

where C(GP, GQ) = C(GP) C(GQ) is the set of maximal cliques in chordal graphs GP and GQ. Definition 5 (computation graph). If a chordal graph, H, is strictly larger than chordal graphs GP and GQ, then H is a computation graph of DMs PGP and QGQ. We can obtain the computation graph H by first taking the graph union of GP and GQ, and then triangulating GP GQ. For the rest of this section, we will provide details on how our method, Junction Forest Computation (JFComp), uses the junction tree algorithm to compute the functional F between 2 DMs. We will first describe how JFComp works on a connected computation graph H before generalising this to cases when H is a disconnected graph.

Junction Tree Computation (JTComp) Recall we want to compute F(P, Q) as expressed in Equation 5. Observe that the 2 nested sums in F(P, Q) has innermost sums over XC with similar forms but over different sets of maximal cliques, C(GP) and C(GQ) respectively. Therefore, we want to re-express F(P, Q) such that the two sums over XC are over the same set of maximal cliques, C(H). Theorem 5. Assume we have 2 DMs PGP and QGQ and a computation graph H for both models. By Definition 4, we also have a mapping α from maximal cliques in PGP and QGQ to maximal cliques in H. Then the following equivalences holds for D {P, Q}: X

x C XC L g DT C (x C) SPC(x C) (7)

xα(C) Xα(C)

L g DT C (xα(C)) SPα(C)(xα(C))

Junction Tree Computation

Junction Forest Computation

SPa = βa P x X βc(x)

Figure 4: Differences in getting the clique beliefs over each maximal clique in the computation graph H between a connected and a disconnected computation graph

With that, we will now show how the computation of SPα(C), C C(H) is equivalent to the running the junction tree algorithm over the junction tree of H with a set of specific initial factors. Figure 3 shows an illustration of this procedure that we will now describe. Theorem 6. Given 2 DMs, PGP and QGQ, and a computation graph of both models, H. Let Ψ be a set of factors defined as follows:

Ψ := g PT CP : CP C(GP } [ n h QT CQ : CQ C(GQ) o

After running the junction tree algorithm over the junction tree of H with factors Ψ, we will get the following beliefs over each maximal clique in H:

C C(H) : βC(x C) = SPC(x C)

Therefore, using the junction tree algorithm, we obtain beliefs over each maximal clique in H that we can use to substitute for SP in Equation 5. However, this assumes that H is a connected graph, which might not always be the case.

Junction Forest Computation (JFComp) We now show how JFComp can be extended to handle cases where H is disconnected. When the computation graph H is disconnected, H can be represented as a list of chordal graphs, H = {Hi}. Therefore, we also have a list of clique trees for each chordal graph in H, T = {Ti}, as well. Since the H is still strictly larger than the chordal graph structure of PGP and QGQ, by Definition 4, there is still a mapping α from C(GP) C(GQ) to C(H). However, since chordal graph H is now comprised of multiple chordal graphs, and therefore clique trees, we are unable to apply Theorem 6 directly to compute SP from Equation 6. The reason for this is because there is no single clique tree to run the junction tree algorithm on, therefore factors from different clique trees are unable to propagate to each other. Instead, we show in Theorem 7, that having a disconnected computation graph H, and therefore a set of clique trees, T , which are disconnected from each other, essentially breaks up SP into smaller sub-problems over each clique tree in T . The results of these sub-problems can then be combined via multiplication to compute SP. An illustration of

the result from Theorem 7 and its difference in computing SP on a connected H can be found in Figure 4.

Definition 6 (τ, clique to clique tree mapping). Let τ be a mapping from the maximal cliques of PGP and QGQ to a clique tree in T that contains the maximal clique given by the clique mapping, α, from Definition 4:

τ : C(GP) C(GQ) T s.t. C C(GP) C(GQ) : α(C) C(τ(C))

Theorem 7. If the computation graph H for DMs PGP and QGQ is disconnected, SPα(C) in Equation 7 can be reexpressed as follows:

SPα(C)(xα(C))

= βα(C)(xα(C)) Y

x XC(Ti) βC(Ti)(xα(C), x)

= βα(C)(xα(C))Rτ(C)

where C(Ti) represents any clique in the set of maximal cliques in clique tree Ti and Rτ(C) R.

Then we can obtain the required beliefs by running the junction tree algorithm for each junction tree in T separately. Therefore, even if the computation graph H is disconnected, we can compute SP and thus F(P, Q). Substituting these beliefs back into Equation 5 we get:

F(P, Q) (8)

C C(GP) Rτ(C) X

xα(C) Xα(C)

L g PT C (xα(C)) βα(C)(xα(C))

C C(GQ) Rτ(C) X

xα(C) Xα(C)

L h QT C (xα(C)) βα(C)(xα(C))

Section will show that the complexity of computing the expression in Equation 8, and in general, that the complexity of JFComp is more efficient than computing F directly.

Computational Complexity We can determine the computational complexity of computing the αβ-divergence between 2 DMs by first checking what the given values for α and β are. This step takes O(1) time. When α, β = 0, from Theorem 2 we know that the complexity of computing D0,0 AB(P || Q) is:

D0,0 AB(P, Q) O(n2ω2ω+1)

where ω(G) is the treewidth of chordal graph G and ω = max(ω(GP), ω(GQ)). When α and β takes values other than 0, we require the use of JFComp to compute parameterisations of F between PGP and QGQ. In general, the αβ-divergence is a linear combination of different parameterisations of F. Therefore, the complexity of computing the αβ-divergence is equivalent to computing F in big-O notation. As such, for the remainder of this section, we will discuss the overall complexity of JFComp for computing F between 2 DMs.

The first step of JFComp involves assigning factors constructed from the CPTs over C(GP) and C(GQ) to C(H) Therefore, for each factor ψ, and therefore for each C C(GP) C(GQ), we need to search through C(H) to find a suitable clique to assign ψ to. This results in the complexity:

O(|C(GP)| |C(H)|) + O(|C(GP)| |C(H)|) O(n2)

since the number of maximal cliques in any chordal graph is bounded by the number of vertices in the graph. Once all ψs have been assigned to a maximal clique in H, we then run the junction tree algorithm to calibrate the clique tree(s) of H with these factors. The complexity of this is:

O(|C(H)| 2ω(H)+1) O(n 2ω(H)+1)

Once the clique tree/forest is calibrated and we know βα(C) for all C C(GP) C(GQ), we can then compute Equation 8:

C C(GP) Rτ(C) X

xα(C) Xα(C)

L g PT C (xα(C)) βα(C)(xα(C))

C C(GQ) Rτ(C) X

xα(C) Xα(C)

L h QT C (xα(C)) βα(C)(xα(C))

O(C(GP) 2ω(H)+1) + O(C(GQ) 2ω(H)+1)

O(n 2ω(H)+1)

Adding up the computational complexity of each step in JFComp results in the final complexity of computing the functional F between 2 DMs PGP and QGQ:

O(n2) + O(n 2ω(H)+1) + O(n 2ω(H)+1)

O(n 2ω(H)+1)

which is more efficient than Ω(2n), the complexity of computing F directly. Therefore, the computational complexity of computing the αβ-divergence between PGP and QGQ is:

D(α,β) AB (P || Q) O(n2 ω2ω+1) α, β = 0 O(n 2ω(H)+1) otherwise

Runtime Comparison with mcgo Recall that a method already exists for computing the KL divergence between 2 BNs (Moral, Cano, and G omez-Olmedo 2021) which we will refer to as mcgo. Also note that it is possible to take a distribution represented by a BN and, in exchange for some loss in independence information, represent it using a DM instead (Koller and Friedman 2009, p.p. 134). Therefore, one might ask, how does the practical runtime of JFComp compare to mcgo when computing the KL divergence between 2 BNs. To answer this question, we will replicate the experiment used by Moral, Cano, and G omez-Olmedo. They chose a set of BNs from the bnlearn (Scutari 2010) repository (https: //www.bnlearn.com/bnrepository/) to sample from and estimated a second BN from these samples. The authors

Network mcgo (secs) JFComp (secs) mean sd mean sd

cancer 0.0117 0.0026 0.0132 0.0033 earthquake 0.0104 0.0025 0.0075 0.0009 survey 0.0140 0.0032 0.0081 0.0002 asia 0.0163 0.0001 0.0137 0.0007 sachs 0.0464 0.0106 0.0151 0.0001 child 0.0778 0.0101 0.0402 0.0013 insurance 0.3838 0.0051 0.1590 0.0029 water 6.9326 0.0329 7.6454 0.0637 mildew 19.326 0.1318 19.459 0.0852 alarm 0.3177 0.0099 0.0875 0.0018 hailfinder 0.8543 0.0243 0.1672 0.0052 hepar2 1.3058 0.0307 0.2403 0.0140 win95pts 1.0256 0.0289 0.3538 0.0049

Table 1: Mean runtimes in seconds and their standard deviation for mcgo and JFComp on computing the KL divergence between 2 BN. The lower the better. Fastest times are bold.

have provided these estimated BNs for each of the BN from bnlearn used in their experiments: https://github.com/ mgomez-olmedo/KL-pgmpy. Therefore, we will use this set of BNs from their repository in our own experiments. Now that we have multiple pairs of BNs, one original and one estimated from samples, we then compute the KL divergence between each BN pair using both mcgo and JFComp. We repeat this 10 times in order to get an estimate of both methods runtime in seconds. We also do not factor in the conversion of these BNs into DMs in the final runtime. We run the experiments on an Intel NUC-10i7FNH with 64GB of RAM. The implementation of both methods are in Python and use the pgmpy library (Ankan and Panda 2015). The repository for the implementation for JFComp can be found at: https://lklee.dev/pub/2023-aaai/code From the results in Table 1, we can observe that despite mcgo containing numerous computation optimisations, our direct application of belief propagation to carry out the computation has a practical runtime that is comparable to mcgo. Furthermore, on some networks, JFComp is faster than mcgo, probably due to having a lower overhead and being better able to leverage the optimized code in the pgmpy library for the bulk of the computation.

Case Study in Model Selection Although allowing for a simpler implementation that can leverage existing library implementations of the junction tree algorithm for most of the computation is a satisfactory result by itself, recall that the original motivation of JFComp is to compute a wider range of divergences between graphical models. Therefore, in order to motivate the need of using divergences other than the KL divergence, we now present a case study on the application of computing divergences between BNs for the problem of model selection, a problem that the KL divergence is normally well suited for. Consider a scientist who, in an attempt to model a natural phenomenon that they have samples from, constructs 2 can-

run Kullback-Leibler Hellinger

A || E B || E A, E B, E

1 0.4021 0.5169 0.3027 0.2915 2 0.3993 0.5182 0.3009 0.2895 3 0.3979 0.5234 0.3014 0.2906 4 0.4018 0.5219 0.3022 0.2904 5 0.3996 0.5275 0.3018 0.2908

Table 2: Divergence between the candidate models and a Bayesian network estimated from randomly sampled datasets of size 10000. Lower numbers indicate a better fit and are bold.

sachs|| A sachs|| B

Kullback-Leibler 0.3687 0.3090 Hellinger 0.3013 0.2921

Table 3: Divergence between the candidate models and the original Bayesian network sachs. Lower numbers indicate a better fit. Lowest divergences are bold.

didate BNs, A and B. They then wish to determine, using the samples, which candidate model is a better representation of the phenomenon they wish to model. One way to do this, is to estimate a new BN, E, from the samples and compute the divergence between E and the candidate models. In order to recreate this scenario synthetically, we use the BN sachs from the bnlearn repository (Scutari 2010) as the phenomenon the scientist wishes to model. The scientist s candidate models are then constructed by removing edges from sachs and marginalising the CPTs according to (Choi, Chan, and Darwiche 2005). Further details regarding the construction of A and B can be found in Appendix I of the extended version of this paper. Sampling 100000 samples from sachs, we then learn BN E from these samples using the constraint-based structure learner in pgmpy and maximum likelihood estimation with Laplace smoothing for learning the parameters of E. The use of a smoothing technique is to ensure that the KL divergence is defined. We then compute the Hellinger and KL divergence between the candidate models and the estimated model: D(A || E) and D(B || E). We repeat the experiment 5 times, with different random samples from sachs. From the results in Table 2, we can observe that the KL divergence indicates that A is the BN closest to E and that the scientist should choose A, while the Hellinger distance indicates the opposite, choosing B instead. With this discrepancy, the question then is, which candidate model, A or B, is actually the closer approximation to the actual phenomenon, and therefore, which divergence is correct . Since, for the purpose of this case study, we already have the true model of the phenomenon we are modelling, we can just compute the divergence between our candidate models and sachs to get an answer. From Table 3, we can observe that when computing the divergence between sachs and the candidate models, both divergences agree

that B is closer to sachs. Consequently, in our case study, our scientist would have chosen the incorrect model if they only used the KL divergence in their experiment. Of course, it might be possible to avoid such a scenario if a different smoothing technique is used to learn the parameters of E. However, the use of multiple divergences is still needed in order for the scientist to even be aware of possible issues in the smoothing technique used in the first place. In general, the main takeaway from this example should be that, one must not be over-reliant on just a single divergence, and that the use of a wide array of divergences can be helpful in avoiding mistakes in model selection.

In conclusion, we showed how computing the functional F, and therefore the αβ-divergence, between 2 DMs is equivalent to belief propagation on a junction tree/forest with a set of specific initial factors defined based on how the MLE of DMs decomposes the functional F. The result is a method with complexity exponential to the treewidth of the computation graph H of these models. Therefore, the proposed method is more efficient than computing the F between the DMs directly unless H is a fully saturated graph. One advantage of JFComp is that it can be easily implemented in any environment that has a pre-existing implementation of the junction tree algorithm. Furthermore, since JFComp can compute the general functional F between 2 DMs, it can compute, or approximate, other divergences or functionals, and not just the αβ-divergence. However, recall that in order to obtain the computation graph H, we take the graph union of the two DMs we wish to compute the divergence between, and triangulate the resulting graph union to form a chordal computation graph. Therefore, one potential area of concern is the possibility for the triangulation step to produce a computation graph H that has a large treewidth. Due to the complexity being exponential to the treewidth of H, this will result in the exact value of F taking a long time to compute. Therefore, one avenue of further research is doing away with the requirement that the computation graph H has to be a chordal graph in exchange for an approximation of F instead of an exact computation. In principle, this can be done by not triangulating GP GQ, and instead running approximate inference algorithms on the graph GP GQ using the set of factors Ψ defined in Theorem 6. Furthermore, throughout this work, we only considered estimating the divergence between 2 discrete distributions. Therefore, more work is needed to investigate how one might extend this approach for divergence estimation to numeric or even mixed type data. Additionally, in this work we were only concerned with computing the divergence between the joint distributions of 2 DMs. However, in practice, it is common to encounter situations where one might want to compute the divergence between 2 conditional distributions. Therefore, more work is needed to investigate this particular problem either by extending the current work or via some new method that only draw inspiration from the current work.

Acknowledgments

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence, LAMARR22B, the Australian Research Council under award DP210100072, as well as the Australian Government Research Training Program (RTP) Scholarship.

References Abdullah, A.; Kumar, R.; Mc Gregor, A.; Vassilvitskii, S.; and Venkatasubramanian, S. 2016. Sketching, Embedding and Dimensionality Reduction in Information Theoretic Spaces. In Artificial Intelligence and Statistics, 948 956. Amari, S.-i. 2016. Information Geometry and Its Applications. Springer. ISBN 978-4-431-55978-8. Ankan, A.; and Panda, A. 2015. Pgmpy: Probabilistic Graphical Models Using Python. Proceedings of the 14th Python in Science Conference, 6 11. Barz, B.; Rodner, E.; Garcia, Y. G.; and Denzler, J. 2019. Detecting Regions of Maximal Divergence for Spatio Temporal Anomaly Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5): 1088 1101. Berry, A.; Blair, S. J. R.; Heggernes, P.; and Peyton, W. B. 2004. Maximum Cardinality Search for Computing Minimal Triangulations of Graphs. Algorithmica, 39(4): 287 298. Bhattacharya, A.; Kar, P.; and Pal, M. 2009. On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces. In Bhowmick, S. S.; K ung, J.; and Wagner, R., eds., Database and Expert Systems Applications, Lecture Notes in Computer Science, 164 172. Berlin, Heidelberg: Springer. ISBN 978-3-642-03573-9. Bulatov, A.; and Grohe, M. 2004. The Complexity of Partition Functions. In Automata, Languages and Programming, volume 3142 of Lecture Notes in Computer Science, 294 306. Heidelberg, Germany: Springer. Chen, L.; Tao, C.; Zhang, R.; Henao, R.; and Carin, L. 2018. Variational Inference and Model Selection with Generalized Evidence Bounds. In International Conference on Machine Learning (ICML), 892 901. Chen, Y.; Ye, J.; and Li, J. 2020. Aggregated Wasserstein Distance and State Registration for Hidden Markov Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9): 2133 2147. Choi, A.; Chan, H.; and Darwiche, A. 2005. On Bayesian Network Approximation by Edge Deletion. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI 05, 128 135. Arlington, Virginia, USA: AUAI Press. ISBN 978-0-9749039-1-0. Cichocki, A.; Cruces, S.; and Amari, S.-i. 2011. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization. Entropy, 13(1): 134 170. Dechter, R. 2003. Constraint processing. Elsevier Morgan Kaufmann. ISBN 978-1-55860-890-0.

Dieng, A. B.; Tran, D.; Ranganath, R.; Paisley, J. W.; and Blei, D. M. 2017. Variational Inference via χ Upper Bound Minimization. In Advances in Neural Information Processing Systems, 2732 2741.

Genevay, A.; Peyre, G.; and Cuturi, M. 2018. Learning Generative Models with Sinkhorn Divergences. In Storkey, A.; and Perez-Cruz, F., eds., International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 of Proceedings of Machine Learning Research, 1608 1617. PMLR.

Haberman, S. J. 1977. The Analysis of Frequency Data. University of Chicago Press. ISBN 978-0-226-31185-2.

Hammersley, J. M.; and Clifford, P. 1971. Markov fields on finite graphs and lattices. Unpublished manuscript.

Heggernes, P. 2006. Minimal triangulations of graphs: A survey. Discrete Mathematics, 306(3): 297 317.

Koller, D.; and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press. ISBN 978-0-26201319-2.

Labeau, M.; and Cohen, S. B. 2019. Experimenting with Power Divergences for Language Modeling. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Empirical Methods in Natural Language Processing (EMNLP), 4102 4112. Association for Computational Linguistics.

Lauritzen, S. L.; and Spiegelhalter, D. J. 1988. Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2): 157 224.

Li, Y.; and Turner, R. E. 2016. R enyi Divergence Variational Inference. In Advances in Neural Information Processing Systems, 1073 1081.

Moral, S.; Cano, A.; and G omez-Olmedo, M. 2021. Computation of Kullback Leibler Divergence in Bayesian Networks. Entropy, 23(9): 1122.

Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.

Piatkowski, N.; Lee, S.; and Morik, K. 2013. Spatio Temporal Random Fields: Compressible Representation and Distributed Estimation. Machine Learning, 93(1): 115 139.

Piatkowski, N.; and Morik, K. 2018. Fast Stochastic Quadrature for Approximate Maximum-Likelihood Estimation. In Conference on Uncertainty in Artificial Intelligence (UAI), 715 724.

Schlimmer, J. C.; and Granger, R. H. 1986. Incremental Learning from Noisy Data. Machine Learning, 1(3): 317 354.

Scutari, M. 2010. Learning Bayesian Networks with the Bnlearn R Package. Journal of Statistical Software, 35: 1 22.

Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3): 410 421. Wainwright, M. J.; and Jordan, M. I. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1 2): 1 305. Webb, G. I.; Lee, L. K.; Goethals, B.; and Petitjean, F. 2018. Analyzing Concept Drift and Shift from Sample Data. Data Mining and Knowledge Discovery, 32(5): 1179 1199.