# synthesis_of_mcmc_and_belief_propagation__fdb1d667.pdf

Synthesis of MCMC and Belief Propagation

Sungsoo Ahn Michael Chertkov Jinwoo Shin

School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea 1 Theoretical Division, T-4 & Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA, 2Skolkovo Institute of Science and Technology, 143026 Moscow, Russia {sungsoo.ahn, jinwoos}@kaist.ac.kr chertkov@lanl.gov

Markov Chain Monte Carlo (MCMC) and Belief Propagation (BP) are the most popular algorithms for computational inference in Graphical Models (GM). In principle, MCMC is an exact probabilistic method which, however, often suffers from exponentially slow mixing. In contrast, BP is a deterministic method, which is typically fast, empirically very successful, however in general lacking control of accuracy over loopy graphs. In this paper, we introduce MCMC algorithms correcting the approximation error of BP, i.e., we provide a way to compensate for BP errors via a consecutive BP-aware MCMC. Our framework is based on the Loop Calculus approach which allows to express the BP error as a sum of weighted generalized loops. Although the full series is computationally intractable, it is known that a truncated series, summing up all 2-regular loops, is computable in polynomial-time for planar pair-wise binary GMs and it also provides a highly accurate approximation empirically. Motivated by this, we ﬁrst propose a polynomial-time approximation MCMC scheme for the truncated series of general (non-planar) pair-wise binary models. Our main idea here is to use the Worm algorithm, known to provide fast mixing in other (related) problems, and then design an appropriate rejection scheme to sample 2-regular loops. Furthermore, we also design an efﬁcient rejection-free MCMC scheme for approximating the full series. The main novelty underlying our design is in utilizing the concept of cycle basis, which provides an efﬁcient decomposition of the generalized loops. In essence, the proposed MCMC schemes run on transformed GM built upon the non-trivial BP solution, and our experiments show that this synthesis of BP and MCMC outperforms both direct MCMC and bare BP schemes.

1 Introduction

GMs express factorization of the joint multivariate probability distributions in statistics via graph of relations between variables. The concept of GM has been used successfully in information theory, physics, artiﬁcial intelligence and machine learning [1, 2, 3, 4, 5, 6]. Of many inference problems one can set with a GM, computing partition function (normalization), or equivalently marginalizing the joint distribution, is the most general problem of interest. However, this paradigmatic inference problem is known to be computationally intractable in general, i.e., formally it is #P-hard even to approximate [7, 8].

To address this obstacle, extensive efforts have been made to develop practical approximation methods, among which MCMC- [9] based and BP- [10] based algorithms are, arguably, the most popular and practically successful ones. MCMC is exact, i.e., it converges to the correct answer, but its convergence/mixing is, in general, exponential in the system size. On the other hand, message

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

passing implementations of BP typically demonstrate fast convergence, however in general lacking approximation guarantees for GM containing loops. Motivated by this complementarity of the MCMC and BP approaches, we aim here to synthesize a hybrid approach beneﬁting from a joint use of MCMC and BP.

At a high level, our proposed scheme uses BP as the ﬁrst step and then runs MCMC to correct for the approximation error of BP. To design such an error-correcting" MCMC, we utilize the Loop Calculus approach [11] which allows, in a nutshell, to express the BP error as a sum (i.e., series) of weights of the so-called generalized loops (sub-graphs of a special structure). There are several challenges one needs to overcome. First of all, to design an efﬁcient Markov Chain (MC) sampler, one needs to design a scheme which allows efﬁcient transitions between the generalized loops. Second, even if one designs such a MC which is capable of accessing all the generalized loops, it may mix slowly. Finally, weights of generalized loops can be positive or negative, while an individual MCMC can only generate non-negative contributions.

Since approximating the full loop series (LS) is intractable in general, we ﬁrst explore whether we can deal with the challenges at least in the case of the truncated LS corresponding to 2-regular loops. In fact, this problem has been analyzed in the case of the planar pairwise binary GMs [12, 13] where it was shown that the 2-regular LS is computable exactly in polynomial-time through a reduction to a Pfafﬁan (or determinant) computation [14]. In particular, the partition function of the Ising model without external ﬁeld (i.e., where only pair-wise factors present) is computable exactly via the 2-regular LS. Furthermore, the authors show that in the case of general planar pairwise binary GMs, the 2-regular LS provides a highly accurate approximation empirically. Motivated by these results, we address the same question in the general (i.e., non-planar) case of pairwise binary GMs via MCMC. For the choice of MC, we adopt the Worm algorithm [15]. We prove that with some modiﬁcation including rejections, the algorithm allows to sample (with probabilities proportional to respective weights) 2-regular loops in polynomial-time. Then, we design a novel simulated annealing strategy using the sampler to estimate separately positive and negative parts of the 2-regular LS. Given any ε > 0, this leads to a ε-approximation polynomial-time scheme for the 2-regular LS under a mild assumption.

We next turn to estimating the full LS. In this part, we ignore the theoretical question of establishing the polynomial mixing time of a MC, and instead focus on designing an empirically efﬁcient MCMC scheme. We design an MC using a cycle basis of the graph [16] to sample generalized loops directly, without rejections. It transits from one generalized loop to another by adding or deleting a random element of the cycle basis. Using the MC sampler, we design a simulated annealing strategy for estimating the full LS, which is similar to what was used earlier to estimate the 2-regular LS. Notice that even though the prime focus of this paper is on pairwise binary GMs, the proposed MCMC scheme allows straightforward generalization to general non-binary GMs.

In summary, we propose novel MCMC schemes to estimate the LS correction to the BP contribution to the partition function. Since already the bare BP provides a highly non-trivial estimation for the partition function, it is naturally expected and conﬁrmed in our experimental results that the proposed algorithm outperforms other standard (not related to BP) MCMC schemes applied to the original GM. We believe that our approach provides a new angle for approximate inference on GM and is of broader interest to various applications involving GMs.

2 Preliminaries

2.1 Graphical models and belief propagation

Given undirected graph G = (V, E) with |V | = n, |E| = m, a pairwise binary Markov Random Fields (MRF) deﬁnes the following joint probability distribution on x = [xv {0, 1} : v V ]:

v V ψv(xv) Y

(u,v) E ψu,v(xu, xv), Z := X

v V ψv(xv) Y

(u,v) E ψu,v,(xu, xv)

where ψv, ψu,v are some non-negative functions, called compatibility or factor functions, and the normalization constant Z is called the partition function. Without loss of generality, we assume G is connected. It is known that approximating the partition function is #P-hard in general [8]. Belief Propagation (BP) is a popular message-passing heuristic for approximating marginal distributions of

MRF. The BP algorithm iterates the following message updates for all (u, v) E:

mt+1 u v(xv) X

xu {0,1} ψu,v(xu, xv)ψu(xu) Y

w N(u)\v mt w u(xu),

where N(v) denotes the set of neighbors of v. In general BP may fail to converge, however in this case one may substitute it with a somehow more involved algorithm provably convergent to its ﬁxed point [22, 23, 24]. Estimates for the marginal probabilities are expressed via the ﬁxed-point messages {mu v : (u, v) E} as follows: τv(xv) ψv(xv) Q

u N(v) mu v(xv) and

τu,v(xu, xv) ψu(xu)ψv(xv)ψu,v(xu, xv)

w N(u) mw v(xu)

w N(v) mw v(xv)

2.2 Bethe approximation and loop calculus

BP marginals also results in the following Bethe approximation for the partition function Z:

log ZBethe = X

xv τv(xv) log ψv(xv) + X

xu,xv τu,v(xu, xv) log ψu,v(xu, xv)

xv τv(xv) log τv(xv) X

xu,xv τu,v(xu, xv) log τu,v(xu, xv)

τu(xu)τv(xv)

If graph G is a tree, the Bethe approximation is exact, i.e., ZBethe = Z. However, in general, i.e. for the graph with cycles, BP algorithm provides often rather accurate but still an approximation.

Loop Series (LS) [11] expresses, Z/ZBethe, as the following sum/series:

Z ZBethe = ZLoop := X

F L w(F), w( ) = 1,

τu(1)τv(1) 1 Y

τv(1) + ( 1)d F (v) τv(1) 1 τv(1)

d F (v) 1 τv(1)

where each term/weight is associated with the so-called generalized loop F and L denotes the set of all generalized loops in graph G (including the empty subgraph ). Here, a subgraph F of G is called generalized loop if all vertices v F have degree d F (v) (in the subgraph) no smaller than 2.

Since the number of generalized loops is exponentially large, computing ZLoop is intractable in general. However, the following truncated sum of ZLoop, called 2-regular loop series, is known to be computable in polynomial-time if G is planar [12]:1

Z2-Loop := X

F L2-Loop w(F),

where L2-Loop denotes the set of all 2-regular generalized loops, i.e., F L2-Loop if d F (v) = 2 for every vertex v of F. One can check that ZLoop = Z2-Loop for the Ising model without the external ﬁelds. Furthermore, as stated in [12, 13] for the general case, Z2-Loop provides a good empirical estimation for ZLoop.

3 Estimating 2-regular loop series via MCMC

In this section, we aim to describe how the 2-regular loop series Z2-Loop can be estimated in polynomial-time. To this end, we ﬁrst assume that the maximum degree of the graph G is at most 3. This degree constrained assumption is not really restrictive since any pairwise binary model can be easily expressed as an equivalent one with 3, e.g., see the supplementary material.

1 Note that the number of 2-regular loops is exponentially large in general.

The rest of this section consists of two parts. We ﬁrst propose an algorithm generating a 2-regular loop sample with the probability proportional to the absolute value of its weight, i.e.,

π2-Loop(F) := |w(F)|

Z 2-Loop , where Z 2-Loop = X

F L2-Loop |w(F)|.

Note that this 2-regular loop contribution allows the following factorization: for any F L2-Loop,

e F w(e), where w(e) :=

τu,v(1, 1) τu(1)τv(1) p

τu(1)τv(1)(1 τu(1))(1 τv(1))

In the second part we use the sampler constructed in the ﬁrst part to design a simulated annealing scheme to estimate Z2-Loop.

3.1 Sampling 2-regular loops

We suggest to sample the 2-regular loops distributed according to π2-Loop through a version of the Worm algorithm proposed by Prokoﬁev and Svistunov [15]. It can be viewed as a MC exploring the set, L2-Loop S L2-Odd, where L2-Odd is the set of all subgraphs of G with exactly two odd-degree vertices. Given current state F L2-Loop S L2-Odd, it chooses the next state F as follows:

1. If F L2-Odd, pick a random vertex v (uniformly) from V . Otherwise, pick a random odd-degree vertex v (uniformly) from F. 2. Choose a random neighbor u of v (uniformly) within G, and set F F initially. 3. Update F F {u, v} with the probability

min 1 n |w(F {u,v})|

|w(F )| , 1 if F L2-Loop

min n 4 |w(F {u,v})|

|w(F )| , 1 else if F {u, v} L2-Loop

min d(v) 2d(u) |w(F {u,v})|

|w(F )| , 1 else if F, F {u, v} L2-Odd

Here, denotes the symmetric difference and for F L2-Odd, its weight is deﬁned according to w(F) = Q

e F w(e). In essence, the Worm algorithm consists in either deleting or adding an edge to the current subgraph F. From the Worm algorithm, we transition to the following algorithm which samples 2-regular loops with probability π2-Loop simply by adding rejection of F if F L2-Odd.

Algorithm 1 Sampling 2-regular loops

1: Input: Number of trials N; number of iterations T of the Worm algorithm 2: Output: 2-regular loop F. 3: for i = 1 N do 4: Set F and update it T times by running the Worm algorithm 5: if F is a 2-regular loop then 6: BREAK and output F. 7: end if 8: end for 9: Output F = .

The following theorem states that Algorithm 1 can generate a desired random sample in polynomialtime. Theorem 1. Given δ > 0, choose inputs of Algorithm 1 as N 1.2 n log(3δ 1), and T (m n + 1) log 2 + 4 mn4 log(3nδ 1). Then, it follows that 1 2

P Algorithm 1 outputs F π2-Loop(F) δ.

namely, the total variation distance between π2-Loop and the output distribution of Algorithm 1 is at most δ.

The proof of the above theorem is presented in the supplementary material due to the space constraint. In the proof, we ﬁrst show that MC induced by the Worm algorithm mixes in polynomial time, and then prove that acceptance of a 2-regular loop, i.e., line 6 of Algorithm 1, occurs with high probability. Notice that the uniform-weight version of the former proof, i.e., fast mixing, was recently proven in [18]. For completeness of the material exposition, we present the general case proof of interest for us. The latter proof, i.e., high acceptance, requires to bound |L2-Loop| and |L2-Odd| to show that the probability of sampling 2-regular loops under the Worm algorithm is 1/poly(n) for some polynomial function poly(n).

3.2 Simulated annealing for approximating 2-regular loop series

Here we utilize Theorem 1 to describe an algorithm approximating the 2-regular LS Z2-Loop in polynomial time. To achieve this goal, we rely on the simulated annealing strategy [19] which requires to decide a monotone cooling schedule β0, β1, . . . , βℓ 1, βℓ, where βℓcorresponds to the target counting problem and β0 does to its relaxed easy version. Thus, designing an appropriate cooling strategy is the ﬁrst challenge to address. We will also describe how to deal with the issue that Z2-Loop is a sum of positive and negative terms, while most simulated annealing strategies in the literature mainly studied on sums of non-negative terms. This second challenge is related to the so-called fermion sign problem common in statistical mechanics of quantum systems [25]. Before we describe the proposed algorithm in details, let us provide its intuitive sketch.

The proposed algorithm consists of two parts: a) estimating Z 2-Loop via a simulated annealing strategy and b) estimating Z2-Loop/Z 2-Loop via counting samples corresponding to negative terms in the 2regular loop series. First consider the following β-parametrized, auxiliary distribution over 2-regular loops:

π2-Loop(F : β) = 1

Z 2-Loop(β) |w(F)|β, for 0 β 1. (2)

Note that one can generate samples approximately with probability (2) in polynomial-time using Algorithm 1 by setting w wβ. Indeed, it follows that for β > β,

Z 2-Loop(β )

Z 2-Loop(β) = X

F L2-Loop |w(F)|β β |w(F)|β

Z 2-Loop(β) = Eπ2-Loop(β) h |w(F)|β βi ,

where the expectation can be estimated using O(1) samples if it is Θ(1), i.e., β is sufﬁciently close to β. Then, for any increasing sequence β0 = 0, β1, . . . , βn 1, βn = 1, we derive

Z 2-Loop = Z 2-Loop(βn)

Z 2-Loop(βn 1) Z 2-Loop(βn 1)

Z 2-Loop(βn 2) Z 2-Loop(β2)

Z 2-Loop(β1)

Z 2-Loop(β1)

Z 2-Loop(β0) Z 2-Loop(0),

where it is know that Z 2-Loop(0), i.e., the total number of 2-regular loops, is exactly 2m n+1 [16].

This allows us to estimate Z 2-Loop simply by estimating Eπ2-Loop(βi) |w(F)|βi+1 βi for all i.

Our next step is to estimate the ratio Z2-Loop/Z 2-Loop. Let L 2-Loop denote the set of negative 2-regular loops, i.e.,

L 2-Loop := {F : F L2-Loop, w(F) < 0}.

Then, the 2-regular loop series can be expressed as

F L 2-Loop |w(F)|

Z 2-Loop = 1 2Pπ2 Loop

w(F) < 0 Z 2-Loop,

where we estimate Pπ2 Loop w(F) < 0 again using samples generated by Algorithm 1.

We provide the formal description of the proposed algorithm and its error bound as follows.

Algorithm 2 Approximation for Z2-Loop

1: Input: Increasing sequence β0 = 0 < β1 < < βn 1 < βn = 1; number of samples s1, s2; number of trials N1; number of iterations T1 for Algorithm 1. 2: for i = 0 n 1 do 3: Generate 2-regular loops F1, . . . , Fs1 for π2-Loop(βi) using Algorithm 1 with input N1 and T1, and set

j w(Fj)βi+1 βi.

4: end for 5: Generate 2-regular loops F1, . . . , Fs2 for π2-Loop using Algorithm 1 with input N2 and T2, and set

κ |{Fj : w(Fj) < 0}|

6: Output: b Z2-Loop (1 2κ)2m n+1 Q

Theorem 2. Given ε, ν > 0, choose inputs of Algorithm 2 as βi = i/n for i = 1, 2, . . . , n 1,

s1 18144n2ε 2w 1 min log(6nν 1) , N1 1.2n log(144nε 1w 1 min),

T1 (m n + 1) log 2 + 4 mn4 log(48nε 1w 1 min),

s2 18144ζ(1 2ζ) 2ε 2 log(3ν 1) , N2 1.2n log(144ε 1(1 2ζ) 1),

T2 (m n + 1) log 2 + 4 mn4 log(48ε 1(1 2ζ) 1)

where wmin = mine E w(e) and ζ = Pπ2 Loop[w(F) < 0]. Then, the following statement holds

" | b Z2-Loop Z2-Loop|

which means Algorithm 2 estimates Z2-Loop within approximation ratio 1 ε with high probability.

The proof of the above theorem is presented in the supplementary material due to the space constraint. We note that all constants entering in Theorem 2 were not optimized. Theorem 2 implies that complexity of Algorithm 2 is polynomial with respect to n, 1/ε, 1/ν under assumption that w 1 min and 1 2Pπ2 Loop[w(F) < 0] are polynomially small. Both w 1 min and 1 2Pπ2 Loop[w(F) < 0] depend on the choice of BP ﬁxed point, however it is unlikely (unless a degeneracy) that these characteristics become large. In particular, Pπ2 Loop[w(F) < 0] = 0 in the case of attractive models [20].

4 Estimating full loop series via MCMC

In this section, we aim for estimating the full loop series ZLoop. To this end, we design a novel MC sampler for generalized loops, which adds (or removes) a cycle basis or a path to (or from) the current generalized loop. Therefore, we naturally start this section introducing necessary backgrounds on cycle basis. Then, we turn to describe the design of MC sampler for generalized loops. Finally, we describe a simulated annealing scheme similar to the one described in the preceding section. We also report its experimental performance comparing with other methods.

4.1 Sampling generalized loops with cycle basis

The cycle basis C of the graph G is a minimal set of cycles which allows to represent every Eulerian subgraph of G (i.e., subgraphs containing no odd-degree vertex) as a symmetric difference of cycles in the set [16]. Let us characterize the combinatorial structure of the generalized loop using the cycle basis. To this end, consider a set of paths between any pair of vertices:

P = {Pu,v : u = v, u, v V, Pu,v is a path from u to v},

i.e., |P| = n 2 . Then the following theorem allows to decompose any generalized loop with respect to any selected C and P.

Theorem 3. Consider any cycle basis C and path set P. Then, for any generalized loop F, there exists a decomposition, B C P, such that F can be expressed as a symmetric difference of the elements of B, i.e., F = B1 B2 Bk 1 Bk for some Bi B.

The proof of the above theorem is given in the supplementary material due to the space constraint. Now given any choice of C, P, consider the following transition from F L, to the next state F :

1. Choose, uniformly at random, an element B C P, and set F F initially.

2. If F B L, update F

( F B with probability min n 1, |w(F B|)|

F otherwise .

Due to Theorem 3, it is easy to check that the proposed MC is irreducible and aperiodic, i.e., ergodic, and the distribution of its t-th state converges to the following stationary distribution as t :

πLoop(F) = |w(F)|

Z Loop , where Z Loop = X

F LLoop |w(F)|.

One also has a freedom in choosing C, P. To accelerate mixing of MC, we suggest to choose the minimum weighted cycle basis C and the shortest paths P with respect to the edge weights {log w(e)} deﬁned in (1), which are computable using the algorithm in [16] and the Bellman-Ford algorithm [21], respectively. This encourages transitions between generalized loops with similar weights.

4.2 Simulated annealing for approximating full loop series

Algorithm 3 Approximation for ZLoop

1: Input: Decreasing sequence β0 > β1 > > βℓ 1 > βℓ= 1; number of samples s0, s1, s2; number of iterations T0, T1, T2 for the MC described in Section 4.1 2: Generate generalized loops F1, , Fs0 by running T0 iterations of the MC described in Section 4.1 for πLoop(β0), and set

s |w(F )|β0,

where F = arg max F {F1, ,Fs0} |w(F)| and s is the number of F sampled. 3: for i = 0 ℓ 1 do 4: Generate generalized loops F1, , Fs1 by running T1 iterations of the MC described in Section 4.1 for πLoop(βi), and set Hi 1 s1 P

j |w(Fj)|βi+1 βi. 5: end for 6: Generate generalized loops F1, Fs2 by running T2 iterations of the MC described in Section 4.1 for πLoop, and set

κ |{Fj : w(Fj) < 0}|

7: Output: b ZLoop (1 2κ) Q

Now we are ready to describe a simulated annealing scheme for estimating ZLoop. It is similar, in principle, with that in Section 3.2. First, we again introduce the following β-parametrized, auxiliary probability distribution πLoop(F : β) = |w(F)|β/Z Loop(β). For any decreasing sequence of annealing parameters, β0, β1, , βℓ 1, βℓ= 1, we derive

Z Loop = Z Loop(βℓ)

Z Loop(βℓ 1) Z Loop(βℓ 1)

Z Loop(βℓ 2) Z Loop(β2)

Z Loop(β1) Z Loop(β1)

Z Loop(β0) Z Loop(β0).

Following similar procedures in Section 3.2, one can estimate Z Loop(β )/Z Loop(β) = EπLoop(β)[|w(F)|β β] using the sampler described in Section 4.1. Moreover, Z Loop(β0) = |w(F )|/PπLoop(β0)(F ) is estimated by sampling generalized loop F with the highest probability PπLoop(β0)(F ). For large enough β0, the approximation error becomes relatively small since PπLoop(β0)(F ) |w(F )|β0 dominates over the distribution. In combination, this provides a desired approximation for ZLoop. The result is stated formally in Algorithm 3.

Figure 1: Plots of the log-partition function approximation error with respect to (average) interaction strength: (a) Ising model with no external ﬁeld, (b) Ising model with external ﬁelds and (c) Hard-core model. Each point is averaged over 20 (random) models.

4.3 Experimental results

In this section, we report experimental results for computing partition function of the Ising model and the hard-core model. We compare Algorithm 2 in Section 3 (coined MCMC-BP-2reg) and Algorithm 3 in Section 4.2 (coined MCMC-BP-whole), with the bare Bethe approximation (coined BP) and the popular Gibbs-sampler (coined MCMC-Gibbs). To make the comparison fair, we use the same annealing scheme for all MCMC schemes, thus making their running times comparable. More speciﬁcally, we generate each sample after running T1 = 1, 000 iterations of an MC and take s1 = 100 samples to compute each estimation (e.g., Hi) at intermediate steps. For performance measure, we use the log-partition function approximation error deﬁned as | log Z log Zapprox|/| log Z|, where Zapprox is the output of the respective algorithm. We conducted 3 experiments on the 4 4 grid graph. In our ﬁrst experimental setting, we consider the Ising model with varying interaction strength and no external (magnetic) ﬁeld. To prepare the model of interest, we start from the Ising model with uniform (ferromagnetic/attractive and anti-ferromagnetic/repulsive) interaction strength and then add glassy variability in the interaction strength modeled via i.i.d Gaussian random variables with mean 0 and variance 0.52, i.e. N(0, 0.52). In other words, given average interaction strength 0.3, each interaction strength in the model is independently chosen as N(0.3, 0.52). The second experiment was conducted by adding N(0, 0.52) corrections to the external ﬁelds under the same condition as in the ﬁrst experiment. In this case we observe that BP often fails to converge, and use the Concave Convex Procedure (CCCP) [23] for ﬁnding BP ﬁxed points. Finally, we experiment with the hard-core model on the 4 4 grid graph with varying a positive parameter λ > 0, called fugacity [26]. As seen clearly in Figure 1, BP and MCMC-Gibbs are outperformed by MCMC-BP-2reg or MCMC-BP-whole at most tested regimes in the ﬁrst experiment with no external ﬁeld, where in this case, the 2-regular loop series (LS) is equal to the full one. Even in the regimes where MCMC-Gibbs outperforms BP, our schemes correct the error of BP and performs at least as good as MCMC-Gibbs. In the experiments, we observe that advantage of our schemes over BP is more pronounced when the error of BP is large. A theoretical reasoning behind this observation is as follows. If the performance of BP is good, i.e. the loop series (LS) is close to 1, the contribution of empty generalized loop, i.e., w( ), in LS is signiﬁcant, and it becomes harder to sample other generalized loops accurately.

5 Conclusion

In this paper, we propose new MCMC schemes for approximate inference in GMs. The main novelty of our approach is in designing BP-aware MCs utilizing the non-trivial BP solutions. In experiments, our BP based MCMC scheme also outperforms other alternatives. We anticipate that this new technique will be of interest to many applications where GMs are used for statistical reasoning.

Acknowledgement

This work was supported by the National Research Council of Science & Technology (NST) grant by the Korea government (MSIP) (No. CRC-15-05-ETRI), and funding from the U.S. Department of Energy s Ofﬁce of Electricity as part of the DOE Grid Modernization Initiative.

[1] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann, 2014.

[2] R. G. Gallager, Low-density parity-check codes, Information Theory, IRE Transactions 8(1): 21-28, 1962.

[3] R. F. Kschischang, and J. F. Brendan, Iterative decoding of compound codes by probability propagation in graphical models, Selected Areas in Communications, IEEE Journal 16(2): 219-230, 1998.

[4] M. I. Jordan, ed. Learning in graphical models, Springer Science & Business Media 89, 1998.

[5] R.J. Baxter, Exactly solved models in statistical mechanics, Courier Corporation, 2007.

[6] W.T. Freeman, C.P. Egon, and T.C. Owen, Learning low-level vision. International journal of computer vision 40(1): 25-47, 2000.

[7] V. Chandrasekaran, S. Nathan, and H. Prahladh, Complexity of Inference in Graphical Models, Association for Uncertainty and Artiﬁcial Intelligence, 2008

[8] M. Jerrum, and A. Sinclair, Polynomial-time approximation algorithms for the Ising model, SIAM Journal on computing 22(5): 1087-1116, 1993.

[9] C. Andrieu, N. Freitas, A. Doucet, and M. I. Jordan, An introduction to MCMC for machine learning, Machine learning 50(1-2), 5-43, 2003.

[10] J. Pearl, Reverend Bayes on inference engines: A distributed hierarchical approach, Association for the Advancement of Artiﬁcial Intelligence, 1982.

[11] M. Chertkov, and V. Y. Chernyak, Loop series for discrete statistical models on graphs, Journal of Statistical Mechanics: Theory and Experiment 2006(6): P06009, 2006.

[12] M. Chertkov, V. Y. Chernyak, and R. Teodorescu, Belief propagation and loop series on planar graphs, Journal of Statistical Mechanics: Theory and Experiment 2008(5): P05003, 2008.

[13] V. Gomez, J. K. Hilbert, and M. Chertkov, Approximate inference on planar graphs using Loop Calculus and Belief Propagation, The Journal of Machine Learning Research, 11: 1273-1296, 2010.

[14] P. W. Kasteleyn, The statistics of dimers on a lattice, Classic Papers in Combinatorics. Birkhäuser Boston, 281-298, 2009.

[15] N. Prokof ev, and B. Svistunov, Worm algorithms for classical statistical models, Physical review letters 87(16): 160601, 2001.

[16] J.D. Horton, A polynomial-time algorithm to ﬁnd the shortest cycle basis of a graph. SIAM Journal on Computing 16(2): 358-366, 1987. APA

[17] H. A. Kramers, and G. H. Wannier, Statistics of the two-dimensional ferromagnet. Part II, Physical Review 60(3): 263, 1941.

[18] A. Collevecchio, T. M. Garoni, T.Hyndman, and D. Tokarev, The worm process for the Ising model is rapidly mixing, ar Xiv preprint ar Xiv:1509.03201, 2015.

[19] S. Kirkpatrick, Optimization by simulated annealing: Quantitative studies. Journal of statistical physics 34(5-6): 975-986, 1984.

[20] R. Nicholas, The Bethe partition function of log-supermodular graphical models, Advances in Neural Information Processing Systems. 2012.

[21] J. Bang, J., and G. Z. Gutin. Digraphs: theory, algorithms and applications. Springer Science & Business Media, 2008.

[22] Y. W. Teh and M. Welling, Belief optimization for binary networks: a stable alternative to loopy belief propagation, Proceedings of the Eighteenth conference on Uncertainty in artiﬁcial intelligence, 493-500, 2001.

[23] A. L. Yuille, CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation, Neural Computation, 14(7): 1691-1722, 2002.

[24] J. Shin, The complexity of approximating a Bethe equilibrium, Information Theory, IEEE Transactions on, 60(7): 3959-3969, 2014.

[25] https://www.quora.com/Statistical-Mechanics-What-is-the-fermion-sign-problem

[26] Dyer, M., Frieze, A., and Jerrum, M. On counting independent sets in sparse graphs, SIAM Journal on Computing 31(5): 1527-1541, 2002.

[27] J. Schweinsberg, An O(n2) bound for the relaxation time of a Markov chain on cladograms. Random Structures & Algorithms 20(1): 59-70, 2002.