# spectral_approximate_inference__34169836.pdf

Spectral Approximate Inference

Sejun Park 1 Eunho Yang 2 3 4 Se-Young Yun 3 5 Jinwoo Shin 1 3 4

Given a graphical model (GM), computing its partition function is the most essential inference task, but it is computationally intractable in general. To address the issue, iterative approximation algorithms exploring certain local structure/consistency of GM have been investigated as popular choices in practice. However, due to their local/iterative nature, they often output poor approximations or even do not converge, e.g., in low-temperature regimes (hard instances of large parameters). To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. Our contribution is two-fold: (a) we ﬁrst propose a fully polynomial-time approximation scheme (FPTAS) for approximating the partition function of GM associating with a lowrank coupling matrix; (b) for general high-rank GMs, we design a spectral mean-ﬁeld scheme utilizing (a) as a subroutine, where it approximates a high-rank GM into a product of rank-1 GMs for an efﬁcient approximation of the partition function. The proposed algorithm is more robust in its running time and accuracy than prior methods, i.e., neither suffers from the convergence issue nor depends on hard local structures, as demonstrated in our experiments.

1. Introduction

Graphical models (GMs) provide a succinct representation of a joint probability distribution over a set of random variables by encoding their conditional dependencies in graphical structures. GMs have been studied in various ﬁelds of machine learning, including computer vision (Freeman et al., 2000), speech recognition (Bilmes, 2004) and deep learning

1School of Electrical Engineering, KAIST, Daejeon, Korea 2School of Computing, KAIST, Daejeon, Korea 3Graduate School of AI, KAIST, Daejeon, Korea 4AITRICS, Seoul, Korea 5Department of Industrial & System Engineering, KAIST, Daejeon, Korea. Correspondence to: Jinwoo Shin <jinwoos@kaist.ac.kr>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

(Salakhutdinov & Larochelle, 2010). Most inference problems arising in GMs, e.g., obtaining desired samples and computing marginal distributions, can be easily reduced to computing their partition function (normalizing constant). However, computing the partition function is #P-hard in general even to approximate (Jerrum & Sinclair, 1993), which is thus a fundamental barrier for inference tasks of GM.

Variational inference is one of the most popular heuristics in practice for estimating the partition function. It is typically achieved via running iterative local message-passing algorithms, e.g., mean-ﬁeld approximation (Parisi, 1988; Jain et al., 2018) and belief propagation (Pearl, 1982; Wainwright et al., 2005). Markov chain Monte Carlo (MCMC) method (Neal, 2001; Efthymiou et al., 2016) is another popular approach, where it usually samples from GMs via Markov chains with a local transition, e.g., Gibbs sampler (Geman & Geman, 1984), and estimates a target expectation by averaging over samples. Unfortunately, both variational and MCMC methods are hard to guarantee the convergence/mixing under some ﬁxed computation budget and known to output poor approximation in the lowtemperature regime, i.e., large parameters of GM, due to the non-existence of the so-called correlation decay (Weitz, 2006; Bandyopadhyay & Gamarnik, 2008). On the other hand, variable elimination (Dechter, 1999; Dechter & Rish, 2003; Liu & Ihler, 2011; Xue et al., 2016; Wrigley et al., 2017; Ahn et al., 2018a;b) is one of popular convergence free methods for approximating the partition function. At each step, it sequentially marginalizes a chosen variable and generates complex high-order factors approximating the marginalized variable and its associated factors. Hence, it guarantees to terminate after marginalizing all variables. However, the performance of variable elimination schemes is also signiﬁcantly degraded in the low-temperature regime, due to its local/iterative nature of processing variables one by one.

Contribution. In this paper, we propose a completely new approach by investigating the global information of GM, to overcome the limitation of prior methods. To this end, we study the spectral feature of the coupling matrix of GM and propose a partition function approximation algorithm utilizing the eigenvectors and eigenvalues. In particular, if the matrix-rank and parameters of GM are bounded, i.e., O(1), then we prove that the proposed algorithm is a fully

Spectral Approximate Inference

Figure 1. An illustration of the proposed partition function approximation scheme.

polynomial-time approximation scheme (FPTAS), even for GMs with high treewidth. Such polynomial-time approximation schemes have been typically investigated in the literature under certain structured GMs (Temperley & Fisher, 1961; Pearl, 1982; Dechter, 1999; Jerrum et al., 2004), and high-temperature regimes (Zhang et al., 2011; Li et al., 2013; Patel & Regts, 2017) or homogeneity of GM parameters (Jerrum & Sinclair, 1993; Sinclair et al., 2014; Molkaraie, 2016; Liu et al., 2017; Patel & Regts, 2017; Molkaraie & Gómez, 2018). Our theoretical result provides a new class of GMs for the direction.

Despite the theoretical value of the proposed algorithm for low-rank GMs, it is very expensive to run for general highrank GMs as its complexity grows exponentially with respect to the rank. To address this issue, we decompose the partition function of high-rank GM into a product of those of rank-1 GMs. Then, we run the proposed FPTAS algorithm to compute all rank-1 partition functions and combine them to approximate the original partition function. For improving our approximation, we additionally suggest running a semi-deﬁnite programming to discover a better spectral decomposition of the partition function. In a sense, our approach is of mean-ﬁeld type, but different from the traditional ones decomposing GM itself without spectral pre-processing. We present an illustration of the proposed scheme in Figure 1.

The proposed mean-ﬁeld scheme can be universally applied to any GMs without the rank restriction. Its computational complexity scales well for large GMs without suffering from the convergence issue. Furthermore, its approximation quality is quite robust against hard GM instances of heterogeneous parameters since the utilized spectral feature grows linearly with respect to the inverse temperature, i.e., scale of parameters. Our experiments demonstrate that the proposed scheme indeed outperforms mean-ﬁeld approximation, belief propagation and variable elimination, in particular, signiﬁcantly in the low-temperature regimes where the prior methods fail.

2. Spectral Inference for Low-Rank GMs

We begin with introducing the deﬁnition of the pairwise binary graphical model (GM). Given a vector θ Rn and a symmetric matrix A Rn n, we deﬁne GM as the following joint distribution on x Ω:= { 1, 1}n:

Z exp θ, x + x T Ax (1)

where , denotes the inner product and Z is the normalizing constant. The above deﬁnition of GM coincides with the following conventional deﬁnition associating with an undirected graph G = (V, E) deﬁned as:

i V θixi + 2 X

(i,j) E Aijxixj

where V = {1, . . . , n} and E = {(i, j) : Aij = 0, i < j}.

The normalizing constant Z of (1) is called the partition function deﬁned as follows:

Z = Z(θ, A) := X

x Ω exp θ, x + x T Ax . (3)

Computing Z is one of the most essential inference tasks arising in GMs. However, it is known to be computationally intractable in general, i.e., #P-hard even to approximate (Jerrum & Sinclair, 1993). In particular, the case when the magnitudes of entries of A are large is called, the lowtemperature regime (Sykes et al., 1965), where Z is known to be harder to approximate provably (Sly & Sun, 2012; Galanis et al., 2014). This is indeed the regime where known heuristics also fail badly.

In this section, we show that Z is possible to be approximated in polynomial-time if there exists a diagonal matrix D such that the rank of A + D is bounded, i.e., O(1). Just for clarity, we primarily focus on the case when A is of low-rank itself (i.e., D = 0) and then describe at the end of this section how our results are extended to the case when A + D is of low-rank for any diagonal matrix D.

Spectral Approximate Inference

2.1. Overall Approach: Approximate Inference via Spectral Decomposition

To design such a polynomial-time algorithm, we ﬁrst reformulate Z using the eigenvalues/eigenvectors of A as follows:

x Ω exp θ, x + x T Ax

j=1 λj vj, x 2

where λj and vj denote the j-th largest non-zero eigenvalue and its corresponding unit eigenvector of A and r denotes the rank of A. We note that such a decomposition is always possible because A is a real symmetric matrix, i.e., all eigenvalues are real. However, even with a small rank r, a naive computation of Z is still intractable as it is a summation over exponentially many terms. Our main idea is approximating λj vj, x 2 in (4) to its quantized value in order to drastically reduce the number of summations. Toward this, we rewrite (4) as

j=1 sign(λj) uj, x 2

where sign(λj) { 1, 1} denotes the sign of λj and uj = p

|λj|vj. Here we deliberately choose some mapping fj : Ω Z (it will be explicitly described in Section 2.2) so that c fj(x) uj, x for some ﬁxed constant c > 0 and hence Z can be nicely approximated as

x Ω exp θ, x exp

j=1 sign(λj) c fj(x) 2

Note that c decides a quantization interval and c fj(x) represents a quantized value of uj, x . Namely, for each x Ω, we will design f(x) = [fj(x)]r j=1 Zr for approximating uj, x for all j.

Given such f, we further process (5) as

x Ω exp θ, x exp

j=1 sign(λj) c fj(x) 2

x f 1(k) exp θ, x

j=1 sign(λj)(c kj)2

k f(Ω) t(k) exp

j=1 sign(λj)(c kj)2

In the above, the ﬁrst equality is from replacing the summation over Ωby that over f(Ω), i.e., for k = [kj]r j=1 f(Ω), each kj represents a possible value of fj(x). For the second equality, we deﬁne t(k) := P

x f 1(k) exp θ, x . Finally, from (5) and (6), one can observe that if t(k) is easy to compute and the cardinality of f(Ω) is small, then the partition function Z can be efﬁciently approximated. In the following section, we provide more details on how to choose f for the desired property.

2.2. How to Choose f and Compute t(k)

Choice of f. A naive choice of f can be

fj(x) = arg min kj Z |c kj uj, x | (7)

for all j. However, with the above choice of f, it is unclear how to compute t(k) efﬁciently (in polynomial-time). To address the issue, we propose a recursive construction of f by relaxing (7): we iteratively deﬁne f(x) for x Si \ Si 1 where Si := {x Ω: xℓ= 1, ℓ> i} so that

{( 1, . . . , 1)} = S0 S1 Sn 1 Sn = Ω.

First, we deﬁne f for S0 following (7):

fj ( 1, . . . , 1) := arg min kj Z

for all j. The construction of f for the rest Sn \ S0 will be done in a recursive manner. Suppose that f(x) is deﬁned for x Si 1. Then, we deﬁne f for x Si \ Si 1 as follow:

fj(x) := fj(x ) + buji (9)

where we deﬁne buji := arg minbuji Z |c buji 2uji| and x Si 1 such that x ℓ= xℓexcept for ℓ= i, i.e., x i = 1. Here, (9) is motivated by the following approximation: c fj(x ) uj, x and the deﬁnition of buji implies that

c (fj(x ) + buji) uj, x + 2uji = uj, x

where the equality is due to xi = 1 and x i = 1.

In essence, we have so far constructed f via a dynamic programming to approximate (7), which allows us to compute t(k) efﬁciently. Furthermore, our choice of f ensures that |f(Ω)| is bounded. Before describing how to compute t(k), let us discuss the bound of |f(Ω)|. For bounding |f(Ω)|, we discover a bounded set B Zr so that f(Ω) B instead of characterizing |f(Ω)| directly. We explicitly describe such B as follows. Claim 1. f(Ω) B where

j=1 { bj, bj + 1, . . . , bj 1, bj},

bj := uj 1/c + (n + 1)/2 .

Spectral Approximate Inference

Furthermore, |B| is bounded by

We present the proof of Claim 1 in the supplementary material. Finally, given t(k) and B as deﬁned in Claim 1, we approximate the partition function Z as follows (see (5) and (6)):

k B t(k) exp

j=1 sign(λj)(c kj)2

where t(k) = 0 if k / f(Ω).

Computation of t(k). We are now ready to describe how to compute t(k). Since t(k) = 0 for k / B, it sufﬁces to compute t(k) for all k B. Similar to the construction of f, we recursively compute

x f 1(k) Si exp θ, x ,

i.e., tn(k) = t(k). The recursive computation of ti(k) is based on the following claim.

Claim 2. ti(k) = ti 1(k) + exp(2θi) ti 1(k [buji]r j=1).

The proof of Claim 2 is presented in the supplementary material. The above claim implies that once ti 1(k) for k B is obtained, ti(k) can be efﬁciently computed using ti 1(k). Here, we consider ti 1(k) = 0 for k / B. Initially, one can ﬁnd t0(k) as follows:

( exp ( Pn i=1 θi) if k = f ( 1, . . . , 1)

0 otherwise

where f ( 1, . . . , 1) is deﬁned in (8).

2.3. Provable Guarantee

The succinct description of the proposed approximate inference algorithm described in Section 2.1 and 2.2 is given in Algorithm 1. We further prove the following theoretical guarantee of the algorithm.

Theorem 3. Algorithm 1 outputs b Z such that

4rc2(n + 1)2 + c n(n + 1)

in O n2r Qr j=1( p

|λj|n/c + n/2 + 1) time.

1Choose t(k) = 0 for k / B.

Algorithm 1 Spectral inference for low-rank GMs

1: Input: θ, λ1, . . . , λr, v1, . . . , vr, c 2: Output: b Z 3: uj p

|λj|vj for all j {1, . . . , r} 4: bj uj 1/c + (n + 1)/2 for all j {1, . . . , r} 5: B Qr j=1{ bj, . . . , bj} 6: t(k) 0 for all k B 7: ℓ [ℓj]r j=1 : ℓj arg minℓj Z |c ℓj + Pn i=1 uji| 8: t(ℓ) exp( P

i θi) 9: for 1 i n do 10: buji arg minbuji Z |c buji 2uji| 11: t (k) exp(2θi)t(k [buji]r j=1) for all k B1

12: t(k) t(k) + t (k) for all k B 13: end for 14: b Z P

k B t(k) exp Pr j=1 sign(λj)(c kj)2

The proof of Theorem 3 is presented in the supplementary material. As expected, a smaller quantization interval c provides a smaller error bound, but a higher complexity (and vice versa). From Theorem 3, given ε (0, 1/2), one can check that Algorithm 1 guarantees

(1 ε)Z b Z (1 + ε)Z,

if we choose

r 1 n + 1, ε 4(P j p

|λj|) n(n + 1)

Under the choice of c, the algorithm complexity becomes O ( 9

εr max(λmax, 1))rn2r+1 where λmax = maxj |λj|. Therefore, if the rank and parameters of GM are bounded, i.e., r, Aij = O(1) for all i, j, Algorithm 1 is a fully polynomial-time approximation scheme (FPTAS) for approximating Z.

Finally, we remark that the following simple trick allows a FPTAS for approximating the partition function of a richer class of GMs: for any diagonal matrix D, one can check

Z = Z(θ, A) = exp Tr(D) Z(θ, A + D). (10)

Namely, if there exists a diagonal matrix D such that the rank of A + D is O(1) (possibly, A is not of low-rank though), then one can run Algorithm 1 to approximate Z(θ, A + D) and use it to derive Z(θ, A) from (10).

3. Spectral Inference for High-Rank GMs

In the previous section, we introduced a FPTAS algorithm for approximating the partition function for the special class of low-rank GMs. However, for general (high-rank) GMs, Algorithm 1 is intractable to run as its complexity grows exponentially with respect to the rank. In this section, we

Spectral Approximate Inference

address the issue by proposing a new efﬁcient partition function approximation algorithm for general GMs of arbitrary rank. The proposed algorithm utilizes Algorithm 1 as a subroutine. Our main idea is to decompose the partition function of GM into a product of that of rank-1 GMs using the mean-ﬁeld approximation, and then handle each rank-1 GM via Algorithm 1.

Throughout this section, we assume GMs with θ = 0. Such a restriction does not harm the generality of our method due to the following:

x Ω={ 1,1}n exp θ, x + x T Ax

x { 1,1}n+1 exp (x )T A x = 1

where A = A 1 2θ 1 2θT 0

x exp (x )T A x

is the partition function of a GM with A . Namely, computing the partition function of any GM is easily reducible to computing that of an alternative GM with θ = 0.

3.1. Overall Approach: From High-Rank to Low-Rank

To handle high-rank GMs, we ﬁrst reformulate the partition function Z by substituting the summation over x with the expectation over x drawn from the uniform distribution UΩ over Ω:

x Ω exp x T Ax = 2n Ex UΩ exp x T Ax .

Then, for approximating the above expectation, we consider the following mean-ﬁeld approximation via some fully factorized distribution q(y) = Qn j=1 qj(yj), where yj = vj, x , y = [yj]n j=1:

Ex UΩ exp x T Ax = Ey PY

j=1 Eyj qj exp λjy2 j ,

where PY(y) := P x Ω: yj= vj,x , j UΩ(x) for y Y and Y := y = [yj = vj, x ]n j=1 : x Ω . Now, we prove the following claim that the choice of qj(yj) = PY(yj) (the marginal probability of the joint distribution PY) is optimal for the mean-ﬁeld approximation in (12), with respect to the Kullback-Leibler (KL) divergence. The proof of Claim 4 is presented in the supplementary material.

Claim 4. KL PY(y)|| Qn j=1 qj(yj) is minimized when qj(yj) = PY(yj) for all j.

In summary, under the choice of qj(yj) = PY(yj), we use the following approximation for Z from (11) and (12):

Z = 2n Ex UΩ exp x T Ax

j=1 Eyj qj exp λjy2 j

j=1 Ex UΩ exp λj vj, x 2 , (13)

where it is easy to check that 2n Ex UΩ exp λj vj, x 2

is equivalent to the partition function of a rank-1 GM induced by λj, vj and can be efﬁciently approximated using Algorithm 1. We further remark that the mean-ﬁeld approximation quality in (13) is expected to be better if variables yj = vj, x for all j are closer to independence. Hence, it is quite a reasonable approximation since for i = j, vi, x , vj, x are pairwise uncorrelated, i.e., Ex UΩ[ vi, x vj, x ] = 0, due to the orthogonality of eigenvectors vi, vj.

We remark that our mean-ﬁeld approximation (13) is different from the traditional one (Parisi, 1988). The latter addresses to ﬁnd a mean-ﬁeld distribution of xi s minimizing the KL divergence with the original distribution P(x), while our approach minimizes the KL divergence between q(yj) and PY(y), i.e., after spectral processing.

3.2. Improving (13) via Controlling the Diagonal of A

It is instructive to remind that varying the diagonal of A only changes the partition function by a constant multiplicative factor, as in (10). In order to fully utilize this, we address to optimize the diagonal of A to improve our mean-ﬁeld approximation. To this end, we build the following meanﬁeld approximation by introducing the additional freedom of choosing a diagonal matrix D:

Ex UΩ exp x T Ax

= Ex UΩ exp x T (A + D)x Tr(D)

j=1 λD j (y D j )2 Tr(D)

j=1 λD j (y D j )2 Tr(D)

where λD j , y D, q D, YD, PYD are those for A + D (analogous to λj, y, q, Y, PY of A). Since it is intractable to ﬁnd the optimal selection for D by directly minimizing the approximation gap of (14) (as computing the true expectations

Spectral Approximate Inference

is intractable), we propose to set the free parameter D by solving the following semi-deﬁnite programming (SDP):

max D Tr(D) subject to A + D 0. (15)

The intuition behind solving (15) is provided in Section 3.3. We also provide its empirical justiﬁcation through experimental studies in Section 4.2. We remark that the SDP (15) is equivalent to (the dual of) the popular semi-deﬁnite relaxation of the max-cut problem (Goemans & Williamson, 1995) and the maximum eigenvalue minimization problem (Delorme & Poljak, 1993). For the complexity of solving (15), the interior point method (Alizadeh, 1995; Helmberg et al., 1996) has O(n3.5 log(1/ε)) running time and the ﬁrst order method (Nesterov, 2007) has O(n3 log n/ε) running time where ε > 0 denotes the target precision to the optimum.2

From (11), (14) and (15), our ﬁnal approximation becomes

Z = 2n Ex UΩ exp x T Ax

2n Ey D q D

j=1 λj(y D j )2 Tr(D)

= 2n exp ( Tr(D))

j=1 Ey D j q D j exp λj(y D j )2

= 2n exp ( Tr(D))

j=1 Ex UΩ exp λD j v D j , x 2

where D is a solution of (15) and v D j is an eigenvector of A + D corresponding to λD j . It is trivial that the above approximation with D = 0 reduces to (13). Finally, we formally state the proposed algorithm in Algorithm 2.

3.3. Intuition for (15)

Now, we describe the intuition why we consider the semideﬁnite programming (15). To this end, let us re-write the approximation error in (14) as the following alternative view: log(Z) log( b Z)

Ey D PYD h exp Pn j=1 λD j (y D j )2 + Tr(D) i

Ey D q D h exp Pn j=1 λD j (y D j )2 + Tr(D) i

Ey D PYD h exp Pn j=1 λD j (y D j )2 i

Ey D q D h exp Pn j=1 λD j (y D j )2 i

where b Z denotes the approximated partition function. One can easily check that the approximation error is 0 when

2We also refer Section 3 of (Waldspurger et al., 2015) and Section 4 of (Goemans & Williamson, 1995) for more details.

Algorithm 2 Spectral inference for high-rank GMs

1: Input: A, c 2: Output: b Z 3: D solution of the semi-deﬁnite programming (15) 4: for 1 j n do 5: λD j j-th largest eigenvalue of A + D 6: v D j j-th largest eigenvector of A + D 7: Ej 2 n Algorithm 1(θ = 0, λD j , v D j , c) 8: end for 9: b Z 2n exp ( Tr(D)) Qn j=1 Ej

λD 1 = = λD n = 0. Thus, we can expect a very accurate estimation when all eigenvalues of A + D are close to 0. One can also observe that if there exists λD j > 0, then the error might be too huge as supy Rn Pn j=1 λD j y2 j = and the supports of PYD and q D are different. Under the above intuitions, we suggest to solve the following problem:

j=1 λD j subject to λD j 0, for all j. (16)

The optimization (16) is equivalent to (15) since Tr(D) = Pn j=1 λD j Tr(A) and the condition λD j 0 for all j is equivalent to A + D 0.

4. Experimental Results

In this section, we evaluate our algorithms on diverse environments including both synthetic and UAI datasets to corroborate our theorem and claims.

4.1. Setups

To begin with, we describe our overall experimental settings. We compare our algorithms against the standard inference schemes dominantly used in most applications: belief propagation (BP) (Pearl, 1982), mean-ﬁeld approximation (MF) (Parisi, 1988), mini-bucket elimination (MBE) (Dechter & Rish, 2003) and weighted mini-bucket elimination (WMBE) (Liu & Ihler, 2011). Since all baselines are iterative methods and have the trade-off between the computation cost and the performance, we choose 200 iterations for BP, 1000 iterations for MF and 10 ibound for MBE and WMBE, for fair comparisons. Below these are referred to as BP-200, MF-1000, MBE-10 and WMBE-10, respectively. In the case of BP and MF, their performances are saturated with the above choice in most cases and there is no gain by running more iterations. On the other hand, one can improve the approximation quality of MBE and WMBE with a larger ibound. However, its complexity grows exponentially with respect to it. We also report the running times of algorithms in our implementation using round brackets following their names, e.g., BP-200 (2s) means that 200 iterations of BP run in 2 seconds (on average) for tested GMs.

Spectral Approximate Inference

Diagonal=0 Maximum Eigenvalue Diagonal Dominant Semi-deﬁnite Programming

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(a) Approximation errors with different D

Quadratic (n2) Cubic (n3) Semi-deﬁnite Programming

Average of running time (s)

Number of vertices 1000 2000 3000 4000 5000

(b) Running time for solving (15)

BP-200 (<1s) MF-1000 (<1s) MBE-10 (<1s) WMBE-10 (3.79s) Algorithm 1 (<1s)

Average of | log Z-log Z |

Coupling strength 0 0.5 1.0 1.5

(c) Approximation errors for rank-1 GMs

Figure 2. Evaluation to measure (a) the effect and (b) running time of the semi-deﬁnite programming (15). (c) Evaluation of Algorithm 1

under rank-1 GMs. In (b), quadratic and cubic denote polynomial regression curves ﬁtted for 100-2000 vertice results.

Throughout our all experiments, we ﬁx c = p

|λj|/1000 for Algorithm 1 and Algorithm 2 to bound its running time regardless of eigenvalues. For solving the semi-deﬁnite programming (SDP) (15), we use CVX (Grant et al., 2008) with SDPT3 solver (Toh et al., 1999) using MATLAB.

For generating synthetic GMs to evaluate on, we ﬁrst choose the graph structure (it will be speciﬁed in each setting) and randomly sample θi Unif[ 1, 1] on its vertices and Aij Unif[ s, s] on its edges where Unif denotes the uniform distribution and s indicates the strength of pairwise couplings. For measuring the running time for all experiments, we run algorithms using a single thread of CPU. To reduce experimental noise, we average 100 random GMs for each plot unless otherwise stated.

4.2. Investigating the Semi-Deﬁnite Programming (15)

In this section, we investigate empirical effects and running time of the proposed SDP (15).

Effect of solving (15). We ﬁrst investigate how (15) helps the mean-ﬁeld approximation (14) used in Algorithm 2 compared to other choices of diagonal matrix D. In particular, we consider three other choices to compare. The ﬁrst choice is D = 0 which does not change the diagonal of A. The second choice is D = maxj λj I which chooses entries of D by the maximum eigenvalue of A so that A + D 0. The last choice is Dii = Pn j=1 |Aij| which forces A+D to be a diagonal dominant matrix, i.e., A + D 0. The second and third choices can be thought as feasible, yet non-optimal solutions of (15). Figure 2a reports the experimental result for measuring the log partition error for GMs on complete graph having 20 vertices. One can observe that solving (15) is important for the approximation performance of Algorithm 2.

Running time for solving (15). Now, we discuss about the empirical complexity of solving (15). Our solver SDPT3 uses the primal-dual interior point method (Toh et al., 1999) for solving (15). To measure the running time of the solver, we generate random GMs on complete graphs by varying the

number of vertices from 100 to 5000. Figure 2b illustrates the average running time of our solver where each point is averaged over 10 random GMs. We compare the running time of our solver with quadratic and cubic polynomials with respect to n. One can observe that the empirical running time to solve (15) is between O(n2) and O(n3), which is better than the theoretical bound of the interior point method O(n3.5) (Helmberg et al., 1996).

4.3. Evaluation of Algorithm 1 under Low-Rank GMs

We evaluate Algorithm 1 under rank-1 GMs, which is used as a subroutine of Algorithm 2. We choose a random eigenvalue λ { 1, 1} and a random eigenvector v Unif {v Rn : v 2 = 1} to generate rank-1 GMs by choosing A = λvv T and θi Unif[ 1, 1]. Given v, we scale λ to match the average value of |Aij| to be equal to some constant s (coupling strength in Figure 2c), i.e.,

j=1:i =j |Aij| = s.

We remark that rank-1 GMs has the special property that if its eigenvalue λ is positive (or negative), they are equivalent to ferromagnetic (or antiferromagnetic) models, i.e., Aij 0 (or Aij 0) for i = j, respectively. Figure 2c reports the algorithm performances under rank-1 GMs. As expected from our theoretical results (Theorem 3), our algorithm is nearly exact, while other algorithms fail. In particular, BP, MBE and WMBE output very poor approximation since they usually fail in antiferromagnetic cases, i.e., negative eigenvalue. The superior performance of Algorithm 1 under rank-1 GMs implies that the approximation error of Algorithm 2 would mainly come from the meanﬁeld approximation (14).

4.4. Evaluation of Algorithm 2 under High-Rank GMs

We now evaluate the empirical performance of Algorithm 2 under synthetic high-rank GMs and UAI datasets (Gogate, 2014). In all cases, we have checked through simulations

Spectral Approximate Inference

BP-200 (<1s) MF-1000 (<1s) MBE-10 (<1s) WMBE-10 (3.32s) Algorithm 2 (<1s)

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(a) ER graph of 20 vertices: p = 0.5

BP-200 (<1s) MF-1000 (<1s) MBE-10 (<1s) WMBE-10 (4.34s) Algorithm 2 (<1s)

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(b) ER graph of 20 vertices: p = 0.7

BP-200 (<1s) MF-1000 (<1s) MBE-10 (<1s) WMBE-10 (3.81s) Algorithm 2 (<1s)

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(c) Complete graph of 20 vertices

BP-200 (<1s) MF-1000 (<1s) MBE-10 (<1s) WMBE-10 (3.55s) Algorithm 2 (<1s)

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(d) Complete bipartite graph of 20 vertices

BP-200 (6.50s) MF-1000 (1.56s) MBE-10 (4.13s) WMBE-10 (38.6s) Algorithm 2 (4.18s)

Average of | log Z-log Z |

Coupling strength 0 1 2 3 4 5

(e) Grid graph of 225 vertices (15 15)

|log Z-log Z|

BP-200 (1.75s, 19.1s) MF-1000 (<1s, 2.71s) MBE-10 (<1s, 14.7s) WMBE-10 (16.0s, 88.6s) Algorithm 2 (<1s, 9.35s)

0 10 20 30 40 50

Data index 1 2 3 4 5 6 7 8

(f) UAI grid GMs

Figure 3. Approximation errors for GMs on various graph structures and UAI datasets. In (a)-(b), p denotes the probability to choose each edge of ER graphs. In (f) for UAI grid GMs, indices 1-4 correspond to GMs of 100 vertices (10 10 grid) and 5-8 correspond to GMs of 400 vertices (20 20 grid). In (f), average running times of algorithms for indices 1-4 and indices 5-8 are noted in ( ), respectively.

that BP and MF do not have better accuracy than BP-200 and MF-1000, respectively, even if we run the algorithms with much longer iterations.

To generate synthetic GMs, we consider Erd os-Rényi (ER) random graphs, complete bipartite graphs, complete graphs, and grid graphs. The experimental results are reported in Figure 3a-3e. In all cases, one can observe that our algorithm signiﬁcantly outperforms others in the high coupling region, i.e., the low-temperature regime. It is known that MF outputs better approximations than others as the underlying graph structure becomes dense, e.g., complete graph (Ellis & Newman, 1978), however, our algorithm remarkably performs better than MF even in such cases. In particular, MF and BP exhibit high variance on their approximation errors in high coupling regions, while ours does not.

We also evaluate our algorithms with GMs on grid graphs in a dataset for UAI 2014 inference competition. It provides 8 GMs on grid graphs, where 4 of them are of 100 vertices (10 10) and the other 4 are of 400 vertices (20 20). Figure 3f reports the approximation error and the running time of each algorithm. In the experimental results, our algorithm consistently has small errors, while other algorithms often fail badly.

Finally, we compare the running times of algorithms under GMs on complete graphs of 100-500 vertices, which are reported in Figure 4. Here, we do not report WMBE since it is slower than MBE. One can observe that Algorithm 2

scales as well as BP, while MBE does not. MF is the fastest, but it is worst in approximation quality under grid and UAI GMs, as reported in earlier experimental results.

BP-200 MF-1000 MBE-10 Algorithm 2

Average of running time (s)

Number of vertices 100 200 300 400 500

Figure 4. Running time under varying the size of GMs on complete graphs.

5. Conclusion

In this paper, we provide a completely new angle to design approximate inference algorithms for graphical models. The proposed algorithms scale well for large scale models as like prior iterative message-passing schemes, and outperforms them in approximation quality, in particular, signiﬁcantly for hard instances. For the future work, we plan to extend our spectral approach to estimating the marginal distributions or/and related inference in higher-order or continuous models.

Spectral Approximate Inference

Acknowledgement

This work was supported by IITP grant funded by the Korea government (MSIT) (No.2017-0-01779, XAI). We would like to acknowledge Sungsoo Ahn for helpful discussions and sharing codes.

Ahn, S., Chertkov, M., Shin, J., and Weller, A. Gauged mini-bucket elimination for approximate inference. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 10 19, 2018a.

Ahn, S., Chertkov, M., Weller, A., and Shin, J. Bucket renormalization for approximate inference. In International Conference on Machine Learning (ICML), pp. 109 118, 2018b.

Alizadeh, F. Interior point methods in semideﬁnite programming with applications to combinatorial optimization. SIAM Journal on Optimization, 5(1):13 51, 1995.

Bandyopadhyay, A. and Gamarnik, D. Counting without sampling: Asymptotics of the log-partition function for certain statistical physics models. Random Structures & Algorithms, 33(4):452 479, 2008.

Bilmes, J. A. Graphical models and automatic speech recognition. In Mathematical Foundations of Speech and Language Processing, pp. 191 245. Springer, 2004.

Dechter, R. Bucket elimination: A unifying framework for reasoning. Artiﬁcial Intelligence, 113(1-2):41 85, 1999.

Dechter, R. and Rish, I. Mini-buckets: A general scheme for bounded inference. Journal of the ACM (JACM), 50 (2):107 153, 2003.

Delorme, C. and Poljak, S. Laplacian eigenvalues and the maximum cut problem. Mathematical Programming, 62 (1-3):557 574, 1993.

Efthymiou, C., Hayes, T. P., Štefankovic, D., Vigoda, E., and Yin, Y. Convergence of MCMC and loopy BP in the tree uniqueness region for the hard-core model. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 704 713. IEEE, 2016.

Ellis, R. S. and Newman, C. M. The statistics of curie-weiss models. Journal of Statistical Physics, 19(2):149 161, 1978.

Freeman, W. T., Pasztor, E. C., and Carmichael, O. T. Learning low-level vision. International Journal of Computer Vision, 40(1):25 47, 2000.

Galanis, A., Štefankoviˇc, D., and Vigoda, E. Inapproximability for antiferromagnetic spin systems in the tree non-uniqueness region. In Annual ACM SIGACT Symposium on Theory of computing (STOC), pp. 823 831. ACM, 2014.

Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 6(6):721 741, 1984.

Goemans, M. X. and Williamson, D. P. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of the ACM (JACM), 42(6):1115 1145, 1995.

Gogate, V. UAI 2014 inference competition. See www.hlt.utdallas.edu/vgogate/uai14-competition, 2014.

Grant, M., Boyd, S., and Ye, Y. CVX: Matlab software for disciplined convex programming, 2008.

Helmberg, C., Rendl, F., Vanderbei, R. J., and Wolkowicz, H. An interior-point method for semideﬁnite programming. SIAM Journal on Optimization, 6(2):342 361, 1996.

Jain, V., Koehler, F., and Mossel, E. The mean-ﬁeld approximation: Information inequalities, algorithms, and complexity. In Conference on Learning Theory (COLT), pp. 1326 1347, 2018.

Jerrum, M. and Sinclair, A. Polynomial-time approximation algorithms for the Ising model. SIAM Journal on Computing, 22(5):1087 1116, 1993.

Jerrum, M., Sinclair, A., and Vigoda, E. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the ACM (JACM), 51(4):671 697, 2004.

Li, L., Lu, P., and Yin, Y. Correlation decay up to uniqueness in spin systems. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 67 84. Society for Industrial and Applied Mathematics, 2013.

Liu, J., Sinclair, A., and Srivastava, P. The Ising partition function: Zeros and deterministic approximation. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 986 997. IEEE, 2017.

Liu, Q. and Ihler, A. T. Bounding the partition function using holder s inequality. In International Conference on Machine Learning (ICML), pp. 849 856, 2011.

Molkaraie, M. An importance sampling algorithm for the Ising model with strong couplings. In 24th International Zurich Seminar on Communications (IZS). ETH-Zürich, 2016.

Spectral Approximate Inference

Molkaraie, M. and Gómez, V. Monte carlo methods for the ferromagnetic Potts model using factor graph duality. IEEE Transactions on Information Theory, 64(12):7449 7464, 2018.

Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001.

Nesterov, Y. Smoothing technique and its applications in semideﬁnite optimization. Mathematical Programming, 110(2):245 259, 2007.

Parisi, G. Statistical ﬁeld theory. Addison-Wesley, 1988.

Patel, V. and Regts, G. Deterministic polynomial-time approximation algorithms for partition functions and graph polynomials. SIAM Journal on Computing, 46(6):1893 1919, 2017.

Pearl, J. Reverend bayes on inference engines: a distributed hierarchical approach. In Proceedings of the Second AAAI Conference on Artiﬁcial Intelligence, pp. 133 136. AAAI Press, 1982.

Salakhutdinov, R. and Larochelle, H. Efﬁcient learning of deep Boltzmann machines. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 693 700, 2010.

Sinclair, A., Srivastava, P., and Thurley, M. Approximation algorithms for two-state anti-ferromagnetic spin systems on bounded degree graphs. Journal of Statistical Physics, 155(4):666 686, 2014.

Sly, A. and Sun, N. The computational hardness of counting in two-spin models on d-regular graphs. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 361 369. IEEE, 2012.

Sykes, M., Essam, J., and Gaunt, D. Derivation of lowtemperature expansions for the Ising model of a ferromagnet and an antiferromagnet. Journal of Mathematical Physics, 6(2):283 298, 1965.

Temperley, H. N. and Fisher, M. E. Dimer problem in statistical mechanics-an exact result. Philosophical Magazine, 6(68):1061 1063, 1961.

Toh, K.-C., Todd, M. J., and Tütüncü, R. H. Sdpt3 matlab software package for semideﬁnite programming, version 1.3. Optimization Methods and Software, 11(1-4):545 581, 1999.

Wainwright, M. J., Jaakkola, T. S., and Willsky, A. S. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313 2335, 2005.

Waldspurger, I., dâ A ZAspremont, A., and Mallat, S. Phase recovery, maxcut and complex semideﬁnite programming. Mathematical Programming, 149(1-2):47 81, 2015.

Weitz, D. Counting independent sets up to the tree threshold. In Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 140 149. ACM, 2006.

Wrigley, A., Lee, W. S., and Ye, N. Tensor belief propagation. In International Conference on Machine Learning (ICML), pp. 3771 3779, 2017.

Xue, Y., Ermon, S., Le Bras, R., Selman, B., et al. Variable elimination in the fourier domain. In International Conference on Machine Learning (ICML), pp. 285 294, 2016.

Zhang, J., Liang, H., and Bai, F. Approximating partition functions of the two-state spin system. Information Processing Letters, 111(14):702 710, 2011.