# stochastic_discrete_clenshawcurtis_quadrature__8e5f10d4.pdf

Stochastic Discrete Clenshaw-Curtis Quadrature

Nico Piatkowski NICO.PIATKOWSKI@TU-DORTMUND.DE Artiﬁcial Intelligence Group, TU Dortmund, Germany

Katharina Morik KATHARINA.MORIK@TU-DORTMUND.DE Artiﬁcial Intelligence Group, TU Dortmund, Germany

The partition function is fundamental for probabilistic graphical models it is required for inference, parameter estimation, and model selection. Evaluating this function corresponds to discrete integration, namely a weighted sum over an exponentially large set. This task quickly becomes intractable as the dimensionality of the problem increases. We propose an approximation scheme that, for any discrete graphical model whose parameter vector has bounded norm, estimates the partition function with arbitrarily small error. Our algorithm relies on a near minimax optimal polynomial approximation to the potential function and a Clenshaw-Curtis style quadrature. Furthermore, we show that this algorithm can be randomized to split the computation into a high-complexity part and a low-complexity part, where the latter may be carried out on small computational devices. Experiments conﬁrm that the new randomized algorithm is highly accurate if the parameter norm is small, and is otherwise comparable to methods with unbounded error.

1. Introduction

Graphical models serve as the underlying framework of various machine learning techniques and facilitate important real world applications from computer vision, computational biology, signal processing and natural language processing, to mention just a few (Pearl, 1988; Lauritzen, 1996; Koller & Friedman, 2009). When using these models, a central problem is that of computing the partition function, since it is required for computing probabilities from the model and it plays a fundamental role in marginal inference (Jaakkola & Jordan, 1996), maximum-

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

likelihood and maximum-a-posteriori parameter estimation (Wainwright & Jordan, 2008), and model selection (Cover & Thomas, 2006; Gr unwald, 2007). Unfortunately, exact computation of the partition function is an intractable (#P-complete) problem for many graphical models of interest, such as those involving a non-tree conditional independence structure (short: structure) or highorder cliques/factors. Moreover, chances to ﬁnd polynomial time algorithms for these tasks are rather low, unless P = NP. For some special cases, polynomial-time algorithms are known (Schraudolph & Kamenetsky, 2008; Goldberg & Jerrum, 2013). However, signiﬁcant research efforts have been conducted in order to derive approximations for more general cases.

There are two well established general approaches to approximate large discrete sums: variational methods and sampling. Methods which provide reasonable tight error bounds that hold for any structure are either not available or require to solve a series of hard combinatorial optimization problems (Ermon et al., 2013).

Based on work from statistical physics, variational methods (Wainwright & Jordan, 2008) are often very fast but do not provide quality guarantees. Loopy belief propagation (LBP) (Pearl, 1988; Kschischang et al., 2001) is well known for computing locally consistent solutions. Nevertheless, this procedure might not converge and if it does, the result is a local minimum of a surrogate objective (Bethe free energy). Tree-reweighted approaches (Wainwright & Jordan, 2008) deliver the tightest upper bound, but the exact error can not be quantiﬁed. In addition, the approximation error depends on the structure of the corresponding probability density. In contrast, the worst-case error of our method is bounded and independent of the structure.

Sampling based techniques are popular, but they suffer from similar issues because the number of samples required to obtain a statistically reliable estimate for multidimensional random variables grows exponentially fast. Markov Chain Monte Carlo (MCMC) methods are asymptotically accurate, but guarantees exist only for certain spe-

Stochastic Discrete Clenshaw-Curtis Quadrature

cial cases. Their performance depend crucially on the choice of the proposal distribution, which often must be domain-speciﬁc and expert-designed (Girolami & Calderhead, 2011). Importance sampling methods (Liu et al., 2015) try to correct for biased proposal distributions, but their error is hard to quantify as well.

In contrast to existing approaches, our method is based on a numerical approximation technique, which is called quadrature. It allows us to derive a scalable approximation procedure for partition functions, that is applicable to any structure, and delivers guarantees on the approximation error. Surprisingly, quadrature based approximations do not appear in the machine learning literature. The technically most related article is a recent method to compute the logdeterminant of large matrices (Han et al., 2015), based on Chebyshev polynomials. Our main contributions can be summarized as follows:

We present a new deterministic algorithm called Discrete Clenshaw-Curtis Quadrature (DCCQ). It gets a structure G, a parameter vector θ Rd and a polynomial degree k as inputs, and outputs ˆZk(θ) such that |Z(θ) ˆZk(θ)| ε

2Z(θ) in time O(k2n2k) whenever k ω( θ 2 ln ε).

Moreover, we show how to split the computation of our approximation into an expensive part, that depends on the structure G and the state space X, and a cheap part, which depends on the model parameters θ. If the target platform has limited computational resources (like a mobile phone or a sensor), the expensive computation may be carried out on a server. Since the expensive part is independent of the model parameters θ, it can be re-used for any repetitive computation of the partition function a setting that is typical in iterative parameter estimation procedures. To this end, we present a randomized algorithm, called Stochastic Discrete Clenshaw-Curtis Quadrature (SDCCQ). It gets G, θ, k and m as inputs and outputs Zm k (θ) such that P[|Z(θ) Zm k (θ)| εZ(θ)] ζ in time O(k2n2k) + O(k2mmax), whenever k ω( θ 2 ln ε) and large enough mi.

Numerical experiments on grid structures show, that the new SDCCQ delivers highly accurate approximations while not suffering from high runtimes, convergence issues or problems with mixed-type potentials, whenever the norm of the parameter vector is small.

As opposed to existing methods, our new approach (i) allows for a user-speciﬁed trade-off in terms of runtime and approximation quality by a single parameter, (ii) can outputs a bound of its own worst-case error, (iii) has no convergence issues, (iv) and allows us to split the computational complexity and re-use existing results.

2. Notation and background

In this section, the preliminaries of our new approximation to the partition function are explained. We provide some background on graphical models, followed by a short recap of quadrature rules and polynomial approximation.

2.1. Graphical Models

Consider a multi-variate random variable X where each Xi with 1 i n takes values xi from space Xi. The concatenation of all n variables yields the random variable X with product state space X = X1 X2 Xn. We denote assignments to single vertices v V by xv = y with y Xv. For any subset of variables U V , XU = N

v U Xv is its joint state space. Moreover, for any {v, u, w} V , we deﬁne vuw := {v, u, w} to simplify notation.

The formalism of exponential families provides a unifying framework for a large variety of probability distributions. In particular, they cover all types of probabilistic graphical models (Lauritzen, 1996; Koller & Friedman, 2009) including Bayesian networks (Pearl, 1988), Markov random ﬁelds (Wainwright & Jordan, 2008), conditional random ﬁelds (Sutton & Mc Callum, 2011), logistic regression, latent Dirichlet allocation (Blei et al., 2003) and recent deep models (Ranganath et al., 2015). Probability densities which are member of an exponential family can be written as

Pθ(x) = exp( θ, φ(x) A(θ)) = 1 Z(θ)ψ(x). (1)

Beside its parameter vector θ Rd, Pθ consists of two major ingredients, namely the potential ψ(x) = exp( θ, φ(x) ) and the partition function Z(θ) = exp(A(θ)).

Let us ﬁrst have a closer look at the potential. It assigns a positive real number to every possible instance x by mapping it from X into a real vector space via a sufﬁcient statistic (or feature map) φ : X Rd, where φ encodes the structure of X. We ease the notation by identifying components of X with vertices of a graph G = (V, E). In discrete state space models, the state of variables is encoded by indicator functions:

φv=y(x) = 1{xv=y} (2)

φvu=yz(x) = 1{xv=y}1{xu=z} (3)

φU=y(x) = Y

v U 1{xv=yv}, U V (4)

To exactly represent the probability mass of a multi-variate random variable X, φ(x) has to contain at least one indica-

Stochastic Discrete Clenshaw-Curtis Quadrature

tor per possible clique assignments for all maximal cliques1

of a structure G (Clifford, 1990). If C(G) is the set of maximal cliques of G, then φ(x) = (φU=u(x) : U C(G), u XU) .

Sufﬁciency of φ is declared with respect to θ, i.e., knowledge about x is not required to infer θ, once φ(x) is known. This property is in particular useful when exponential families have to be applied in resource constrained, autonomous, medical or self quantiﬁcation devices, because arbitrary large data sets may be aggregated into a ﬁnite dimensional representation. In fact, only members of exponential families have this property (Pitman, 1936). We now identify another property of sufﬁcient statistics which will become crucial for ﬁnding an efﬁcient inference procedure:

Deﬁnition 1. The sufﬁcient statistics φ : X Rd is called χ-integrable if χφ : [d]k R with

dν(x), k N, j [d]k

admits a closed-form expression, that is computable in polynomial time. [d]k is an abbreviation for {1, 2, . . . , d}k.

In fact, if φ is composed out of indicator functions like (2), (3) or (4), it is χ-integrable. We will prove this statement in Section 4. Note, however, that our method is not restricted to binary sufﬁcient statistics or discrete state space models any χ-integrable φ sufﬁces.

The partition function

X ψ(x)dν(x) (5)

accumulates the potential of every possible instance and ensures normalization of Pθ. It is deﬁned w.r.t. a reference measure ν. For continuous (discrete) state spaces, ν is the Lebesgue (counting) measure. If the graph G contains no loops, i.e., is tree-structured, Z(θ) can be evaluated exactly in polynomial time. The same holds for planar Ising models (Schraudolph & Kamenetsky, 2008).

If no special property of the underlying structure can be exploited, evaluating Z(θ) is hard in fact, #P-complete (Valiant, 1979; Bulatov & Grohe, 2004). The junction tree (JT) algorithm (Lauritzen & Spiegelhalter, 1988; Wainwright & Jordan, 2008) can be applied to compute the exact value Z(θ) with runtime exponential in the size of the largest clique of the triangulation of graph G. Nevertheless, several approximate methods arose in the last decades. The Bethe approximation (Bethe, 1935; Wainwright & Jordan, 2008), where the main underlying idea is to treat a general graph like a tree, is maybe the most popular, computed by

1Cliques are fully connected subgraphs. A clique is maximal if it is not contained in any other clique.

loopy belief propagation (Pearl, 1988; Frey, 2000; Kschischang et al., 2001). Its runtime depends on the number of message passing iterations until convergence. This number is in general unknown and the algorithm might not even converge. If no further assumptions are made, the Bethe approximation delivers an estimate of unknown quality. It has been shown quite recently, that in case of log-supermodular potential functions (Ruozzi, 2012; Weller & Jebara, 2014), the Bethe approximation is a lower bound on the partition function. Also the naive mean ﬁeld (MF) technique (Weiss, 2001; Wainwright & Jordan, 2008) is known to provide a lower bound on Z(θ) (Weiss, 2001). The best known general upper bound is based on convex combinations of exponential family parameters θ(T) which correspond to spanning-trees T T (G) of the original graph G, known as tree-reweighted belief propagation (TRW) (Wainwright et al., 2005). An overview is provided in Table 1.

2.2. Quadrature

All of the existing approaches mentioned above are based on an approximation of the structure. In contrast, we propose a numerical approximation of the integration, based on the general quadrature. If integrating a function is not tractable, one has to resort to numerical methods in order to approximate the deﬁnite integral. The basic idea of a quadrature rule is to replace the integrand f by an approximation h f that admits tractable integration. It turns out that choosing h = hk to be a degree-k Chebyshev polynomial approximation of f has outstanding properties like rapidly decreasing and individually converging coefﬁcients (Gautschi, 1985). The general quadrature procedure can be summarized as Z u

l f(x)dx Z u

l hk(x)dx =

i=0 wif(xi) (6)

where x R, wi are certain coefﬁcients and xi are certain abscissae in [l, u] (all to be determined) (Mason & Handscomb, 2002). Depending on the choice of interpolation points and different kinds of orthogonality properties, Chebyshev polynomial based quadrature rules are termed Gauss-Chebyshev quadrature, Fej er quadrature or Clenshaw-Curtis quadrature (Clenshaw & Curtis, 1960). We make use of discrete orthogonality properties and initialize our approximation at the zeros of second kind Chebyshev polynomials, which is a Clenshaw-Curtis quadrature.

Chebyshev Polynomials. In order to construct a quadrature rule that is numerically well behaved, Chebyshev Polynomials Tk(x) are chosen as a basis. The fundamental recurrence relation is a convenient representation:

T0(x) = 1, T1(x) = x, Tk(x) = 2x Tk 1(x) Tk 2(x). (7)

Stochastic Discrete Clenshaw-Curtis Quadrature

Table 1. State-of-the-art methods to compute/approximate the partition function of an undirected model with graph G = (V, E), n = |V |, m = |E|. MCMC-based methods are omitted. Here, I is the number of iterations until convergence. L = maxv V |Xv| and = maxv V |Nv| are the largest vertex domain and neighborhood size, respectively. w is the tree-width of G and d is the dimension of the parameter space. kε is the polynomial degree (implied by ε) as speciﬁed by Theorem 5. mζ = maxi mi is the number of samples (implied by ζ) from the distribution (11) as speciﬁed by Theorem 7.

Algorithm Complexity Quality

JT (Lauritzen & Spiegelhalter, 1988) O(Lw) Exact

MF (Weiss, 2001) O(In L ) Lower bound

LBP (Heskes, 2002; Yedidia et al., 2003)

O(Im L2 ) Local minimum of Bethe free energy

TRW (Wainwright et al., 2005) O(Im L2 + m log n) Upper bound

WISH (Ermon et al., 2013) O(n ln(n/ζ)) Time(MAP) (16, ζ)-approx

DCCQ (Alg. 1) O(k2 εd2kε) ε-approx (Theorem 5)

SDCCQ (Alg. 2) O(k2 εd2kε) + O(k2 εmζ) (ε, ζ)-approx (Theorem 7)

It can be easily veriﬁed that each Tk is a polynomial of degree-k. The degree-k Chebyshev interpolation hk of a continuous function f is hk(x) = Pk i=0 ci Ti(x) with

ci = 2 k + 1

j=1 f(xj)Ti(xj). (8)

The interpolation points xj = cos jπ

k are given by the extrema of the corresponding Chebyshev polynomial which are, at the same time, the zero of the Chebyshev polynomials of the second kind.

The major distinguishing properties of Tk are (i) discrete orthogonality, which is required for the derivation of the coefﬁcients (8) and (ii) the fact that 21 k Tk is the minimax approximation to the 0-function on [ 1, 1] (Mason & Handscomb, 2002). Let h H be an approximation to f on the domain [l, u]; h is called minimax approximation to f iff

h H : f h = sup x [l,u] |f(x) h(x)| f h

Chebyshev showed that an outstanding property of any minimax approximation is an oscillating error curve.

The coefﬁcients given by (8) result in an approximation to the theoretically optimal (minimax) approximation (Mason & Handscomb, 2002) and allow for a rather fast computation in time O(k log k) via discrete cosine transformation (DCT). Coefﬁcients which deliver an approximation that is even closer to the optimal one can be obtained by the Remez exchange algorithm (Fraser, 1965). Error functions for DCT and Remez coefﬁcients are shown in Figure 1 (i). Since both approaches deliver high quality approximations, we use DCT because of its superior efﬁciency.

Finally, the coefﬁcients c, which are coefﬁcients of Chebyshev polynomials, can be converted to native coefﬁcients c coefﬁcients of powers of x, say xi by summing the corresponding coefﬁcients that appear in the Chebyshev polynomials, weighted by ci. Every Chebyshev interpolation can thus be equivalently expressed as

i=0 cixi. (9)

Although the general quadrature is known since long, we, for the ﬁrst time, exploit it for approximating the partition function.

3. Algorithm

We start with the intuition behind our algorithm for approximating Z(θ) called Discrete Clenshaw-Curtis Quadrature. Indeed, we aim at approximating (5) by a quadrature rule (6). In contrast to (6), the integration in (5) has to be carried out over the n-dimensional set X. Note, however, that evaluating the potential at x is equivalent to computing exp(r) for r = θ, φ(x) . Integrating ψ over X is thus equivalent to integrating exp over W = {r R : r = θ, φ(x) , x X}. But without any further assumption on W, the deﬁnite integral R

W exp(z)dz continues to have no closed-form expression. We hence take hk to be the Chebyshev interpolation of exp on [l, u] with l = min W and u = max W. Moreover, hk may be interpreted as approximation ˆψk(x) ψ(x) over X. Computing max W is NP-hard, since it is equivalent to computing the maximuma-posteriori assignment of the underlying graphical model.

We may use an upper bound on max W instead: Remember that C(G) is the number of (maximal) cliques in G. For binary sufﬁcient statistics (2), (3) and (4), it holds

Stochastic Discrete Clenshaw-Curtis Quadrature

Algorithm 1 DCCQ Input: θ Rd, k N, φ Output: Approximate partition function ˆZk(θ)

1: [l, u] interval(θ) 2: c coeﬃcients(k, [l, u]) // Eq. (8) 3: ˆZk(θ) Pk i=0 ci P

j [d]i χφ(j) Qi l=1 θjl

that x : φ(x) 2 = p

|C(G)|, since each clique is in exactly one state. Therefore, max W θ 2B which follows directly from the Cauchy-Schwarz inequality with B = p

Now, we can approximate ψ on [l, u] by Chebyshev polynomials, and hence:

X ψ(x)dν(x) Z

X ˆψk(x)dν(x)

X hk( θ, φ(x) )dν(x)

i=0 ci θ, φ(x) idν(x)

χφ(j) =: ˆZk(θ). (10)

Evaluating the last line has an asymptotic runtime of O(k2n2k) Time(χφ) which is polynomial in the size of the graph. The procedure is summarized in Algorithm 1. We will investigate in Section 4 which k is required to achieve a reasonable accuracy.

3.1. Complexity-decoupling via Randomization.

O(k2n2k) is lower than the time that it takes to enumerate the full state space X, but the partition function has to be recomputed every time the model changes. This happens frequently during iterative parameter learning procedures. Moreover, it is well known that the complexity of inference is mainly inﬂuenced by the structure (Lauritzen & Spiegelhalter, 1988; Wainwright & Jordan, 2008). So when this structure does not change, is it necessary to redo this costly computation during learning? It turns out that if we combine our Algorithm 1 with a probabilistic sampling scheme, it is possible to decouple the most costly computation from the rest. Surprisingly, the costly part depends only on the structure while the feasible part combines it with the parameters. To see this, notice that the function χφ : [d]k R is non-negative, hence

Pφ(j | k) = Qφ(k) 1χφ(j) (11)

with Qφ(k) = P j [d]k χφ(j ) deﬁnes a proper probabil-

ity mass function (pmf) on [d]k. ˆZk(θ), as given by (10),

Algorithm 2 SDCCQ Input: θ Rd, k N, m Nk, φ Output: Approximate partition function Zm k (θ) 1: [l, u] interval(θ) 2: c coeﬃcients(k, [l, u]) 3: for i = 1 to k do 4: if Qφ(i) not cached then 5: Qφ(i) P

j [d]i χφ(j) 6: cache Qφ(i) 7: Zm k (θ) 0 8: for i = 1 to k do 9: sum 0 10: for r = 1 to mi do 11: sample j Pφ(J | i)

12: sum sum + Qi l=1 θjl 13: Zm k (θ) Zm k (θ) + ci Qφ(i) 1 mi sum

may now be rewritten as follows:

j [d]k χφ(j)

i=0 ci Qφ(i) X

j [d]k Pφ(j | i)

l=1 θJl | i

where wi = ci Qφ(i) and the expectation is taken with respect to the random variable J with pmf Pφ(J = j | i) as deﬁned in (11). The runtime has not yet improved compared to (10). However, notice that the expectation EJ h Qi l=1 θJl | i i may be approximated via

ˆEJ h Qi l=1 θJl | i i = 1 m Pm r=1 Qi l=1 θj(r) l by drawing m

samples from Pφ(J | i). Most important, Qφ(i) will not change as long as the structure G does not change. This procedure is summarized in Algorithm 2.

It is clear from line 4 that the Qφ(i) are only computed once. If they are cached, the runtime is O(k2 maxi mi). In Section 4 it is shown which m sufﬁce to guarantee a small approximation error. However, we can anticipate that mi dk. We like to stress again that the Qφ(i) will not change as long as the graphical structure and the state space X are ﬁxed, moreover the Qφ(i) need not be computed on the same machine. It is reasonable to assume that the Qφ(i) are computed on a large cluster while the inference is eventually carried out on a small device with low computational resources or energy constraints.

Stochastic Discrete Clenshaw-Curtis Quadrature

Approximation error

-4 -2 0 2 4

Taylor DCT Remez

0 0.25 0.5 0.75 1

0 0.25 0.5 0.75 1

0 8 16 24 32

k=2 k=4 k=8

x θ 2 θ 2 Num. of CPUs

Figure 1. From left to right: (i): Absolute error of degree-16 Taylor, Remez and DCT approximations of exp(x) on the interval [ 5, 5]. Only Remez and DCT have oscillating error curves. (ii): Theoretical bounds for the absolute error |Z(θ) ˆZk(θ)| of DCCQ (Alg. 1) as a function of the parameter norm for a 4 4 Ising grid model, predicted by Lemma 3. (iii): Absolute error |Z(θ) Zm k (θ)| of SDCCQ (Alg. 2) as a function of the parameter norm for a 4 4 Ising grid model. i : mi = 104. (iv): Scalability of SDCCQ (Alg. 2). Runtime in seconds as a function of the number of CPU cores for different polynomial degrees. Best viewed in color.

4. Analysis

The most important prerequisite for our algorithms is χintegrability of the sufﬁcient statistic.

Lemma 2. The sufﬁcient statistics of a discrete state space model, constructed from binary indicator functions, (2), (3) and (4), is χ-integrable.

Proof. The state space is discrete and thus the deﬁnition of χ-integrability becomes

, k N, j [d]k

Since φ(x)ji is binary, the product Qk i=1 φ(x)ji is binary too. The summation therefore equals the number of instances x X for which the product evaluates to 1 for an arbitrary but ﬁxed j. We will derive this number now: As described in Section 2, each index of the vector φ(x) corresponds to a particular assignment of a value y to a clique C. Since j may be any k-dimensional vector with elements from {1, 2, . . . , d}, it might happen that it contains incompatible indices. Two incompatible indices correspond to two different assignments to the same vertex, which is not possible. Since the summation is over all possible instances, the product can never evaluate to 1 for vectors j which contain incompatible indices, and hence χφ(j) = 0. Checking if a vector contains a pair of incompatible indices can be done with o(k2) comparisons. Any j that contains only compatible indices, corresponds to a possible assignment y U(j) to an induced subset U(j) V of vertices. The product will evaluate to 1 for every instance x that matches this particular assignment, i.e., x U(j) = y U(j). Fixing the values of the vertices in U(j) results in a remaining state space of size |X|/|XU(j)|, which is exactly the number of times that the product will evaluate to 1. This ﬁnally implies that iff j contains no incompatible indices,

then χφ(j) = |X|/|XU(j)| and 0 otherwise. This can of course be computed in polynomial time and φ is thus χintegrable.

In the rest of this section, we present our main theorems which provide bounds on the error of Algorithms 1 and 2. Formally, the following error bound is known for Chebyshev approximations (Xiang et al., 2010).

Lemma 3. Suppose f is analytic in the region bounded by the ellipse Eρ = {z C : |z +

z2 1| = ρ} with major and minor semi-axis lengths summing to ρ > 1, foci at 1, and maxz Eρ |f(z)| M. Let gk denote the degree-k interpolant of f according to Eq. (9), then for each k 0,

max z { 1,1} |f(z) gk(z)| 4M (ρ 1)ρk .

To apply this Lemma in the sequel, we will consider the function f = exp γ with γ : [l, u] [ 1, 1] and γ(x) = l+u

2 (Bernstein, 1912; Mason & Handscomb, 2002). Any choice of ρ > 1 would sufﬁce, but increasing ρ increases the imaginary axis of E, and, at the same time, both numerator and denominator of the error bound. On the other hand, for ρ = 1, Eρ collapses to the real line, which is not compatible with the Lemma. We set ρ = 1 +

2 as suggested in (Xiang et al., 2010).

Lemma 4. Mean ﬁeld lower bound. Any mean parameter µ M with µ = P

x P(x)φ(x) yields a lower bound on the partition function: A(θ) θ, µ A (µ), where A is the convex conjugate of A = ln Z. (Wainwright & Jordan, 2008)

It can be shown that A is equal to the Shannon entropy H(P) = P

x P(x) ln P(x).

Theorem 5. Deterministic approximation error. Let φ be χ-integrable, ε > 0, k (ln 8 + 2 ln M ln(ε(ρ 1)))(ln ρ) 1 and x : φ(x) 2 B. Then, the

Stochastic Discrete Clenshaw-Curtis Quadrature

output ˆZk(θ) of Algorithm 1 satisﬁes

|Z(θ) ˆZk(θ)| ε

Proof. The idea is to show that the lower bound from Lemma 4 is also an upper bound for the right-hand-side in Lemma 3. To see this, construct the naive mean ﬁeld variational lower bound, based on a fully factored joint probability P(X = x) = Q

v V Pv(Xv) with uniform vertex marginals Pv(Xv) = U(Xv) (here we assume that the uniform distribution on Xv exists) for all v V . This implies for edge marginals that Pvu(Xv, Xu) = Pv(Xv)Pu(Xu). Plugging this into Lemma 4, we get

A(θ) θ, µ + H(µ) θ 2 µ 2 + H(µ)

1 |Xv| ln 1 |Xv|

v V ln |Xv|

since φ(x) 2 B µ 2 B. Now, let M = maxz Eρ | exp(γ(z))| = exp(γ(

2)) = exp((l + u)/2 +

2(u l)/2) > exp((l + u)/2 + (u l)/2) = exp( θ 2B) maxx ψ(x) be an upper bound on the potential, where we set u = θ 2B as suggested in Section 3. By applying Lemma 3, it follows that

|Z(θ) ˆZk(θ)| = | X

x |ψ(x) ˆψ(x)| |X| 4M (ρ 1)ρk

Notice that the occurrence of θ 2 is an artifact of the Cauchy-Schwarz inequality. One may conduct the same derivation based on H older s inequality, which in turn would replace the term θ 2B by θ 1B .

Now, we analyze the error that we have to tolerate, if we want to apply complexity-decoupling as explained in Section 3.1. Therein, additional error is introduced by approximating the expected value EJ[Qi l=1 θJl | i] with the sample mean ˆEm[Qi l=1 θJl | i] := 1 m Pm r=1 Qi l=1 θj(r) l ,

where each j(r) is sampled from the distribution with pmf (11). Indeed, ˆEm is unbiased and converges to EJ. However, we would like to quantify the amount of samples that is required to achieve a desired accuracy.

Theorem 6. Let X = X(J) = Qi l=1 θJl, where J is the random variable with pmf (11). Moreover, let εi > 0, ζ0

(0, 1) and m (ln ζ0)/(2 θ i εi | ln M + ln |X||), where M is an upper bound on the potential, then

P EJ [X | i] ˆEm [X | i] εi |Z (θ)| ζ0.

Proof. Applying a Chernoff-like argument, followed by the triangle inequality and Lemma 4, we have

P ˆEm [X | i] E [X | i] εi |Z (θ)|

+ m |E [X]| mεi |Z (θ)|

exp m2 θ i mεi |Z (θ)|

exp m 2 θ i εi | ln M + ln |X|| = ζ0

Theorem 7. Stochastic approximation error. Let the preconditions of Theorem 5 hold. Furthermore, let ζ (0, 1). Algorithm 2 with mi (ln ζ

k)/(2 θ i ε 2k|wi| |ln |X| ln M|) delivers an (ε, ζ)-approximation of the partition function. In particular,

P |Z(θ) Zm k (θ)| εZ(θ) ζ.

Proof. Let X = X(J) = [Qi l=1 θJl | i] and wi = ci Qφ(i). We conclude, that

P |Z(θ) Zm k (θ)| εZ(θ)

i=1 |EJ[X | i] ˆEmi[X | i]| ε 2|wi|Z(θ)

i=1 P |EJ[X | i] ˆEmi[X | i]| ε 2k|wi|Z(θ)

from the triangle inequality and Theorem 5. Now, the claim follows from Theorem 6 with εi = ε 2k|wi| and ζ0 = ζ

Notice that c is required to compute m. But this is not restrictive, since c can be computed easily with (8) followed by a conversion to native coefﬁcients. Hence, it is safe to assume that the c are available for computing m.

5. Numerical Evaluation

For our experiments, we implemented DCCQ (Alg. 1) and SDCCQ (Alg. 2) and execute them on a machine with 40 E5-2697 Xeon CPU cores. The numerical evaluation should show that SDCCQ achieves highly accurate results whenever the norm of the parameter vector is small. In addition, we investigate cases when the norm is not small and

Stochastic Discrete Clenshaw-Curtis Quadrature

Approximation error

0 0.5 1 1.5 2 2.5 3 Coupling strength ω

0 0.5 1 1.5 2 2.5 3 Coupling strength ω

0 0.5 1 1.5 2 2.5 3 Coupling strength ω

0 0.5 1 1.5 2 2.5 3 Coupling strength ω

Figure 2. Estimation errors for the log-partition function on 10 10 Ising grids with randomly generated parameter vectors for various coupling strengths. (i): attractive, κ = 0.1, (ii) attractive, κ = 1.0, (iii) mixed, κ = 0.1, (iv) mixed, κ = 1.0. Best viewed in color.

compare our method to the approaches which are listed in Table 1. Finally, we analyze the scalability of SDCCQ in terms of parallel computing.

One of the main beneﬁts of SDCCQ is the complexitydecoupling. This allows us to compute the Qφ(i) (see Alg. 2) only once per structure and reuse them in every run2. In total, we computed the Qφ(i) for Ising grid models with sizes 2 2, . . . , 128 128 and i {1, 2, . . . , 16}.

Parameters with small norm. First of all, we show that SDCCQ achieves small errors when the norm of θ is small ( 1). Therefore, we conducted a 4 4 Ising grid with random Gaussian parameters, i.e., θi N(0, σ). We varied σ from 10 4 to 1 in order to generate models with various norms. The partition function was then estimated by SDCCQ for polynomial degrees in {2, 4, 6, 8, 10, 12} with mi = 103, i, while the correct value of the partition function was computed by the JT algorithm. In Figure 1 (iii), the average absolute estimation error, |Z(θ) Zm k (θ)| (yaxis), is plotted against the norm of θ (x-axis). For comparison, the theoretical error curves which are implied by Lemma 3 are depicted in Figure 1 (ii). Empirically, the error of the stochastic low-degree approximations is lower than predicted while the error of the high-degree approximations is higher than predicted. Note, however, that the error bound of SDCCQ (Theorem 7) is stochastic, i.e., an ε-approximation is achieved with a certain probability. Further experiments show that the error can indeed be decreased by increasing the mi. In this setting, an absolute error of 12000 corresponds to a relative error of ε < 0.2.

Parameters with large norm. In the second experiment, we investigate the quality of our new method when the norm of the parameter vector is 1 and hence, Theorem 7 predicts a high error for small polynomial degrees. Nevertheless, the question is how SDCCQ compares to other approaches. We use the same experimen-

2Our C++ source code and the precomputed Qφ(i) values are available at http://sfb876.tu-dortmund.de/sdccq.

tal setup as in (Ermon et al., 2013). Speciﬁcally, we have n = 10 10 binary variables x { 1, 1}n with weights θvu=xy = wvu whenever x = y and θvu=xy = wvu otherwise. In the attractive setting, the wvu are drawn from [0, ω]; in the mixed setting, from [ ω, ω]. Moreover, vertex weights θv=1 = θv= 1 = wv are sampled from [ κ, κ] with κ {0.1, 1.0}. This setting implies that θ 2 is in the range [3.86, 44.53], which is far outside of the interval in which we should expect small errors. Figure 2 shows the estimation error for the log-partition function, i.e. ln Zm k (θ) ln Z(θ), averaged over 5-folds of SDCCQ, WISH, MF, TRW and LBP. SDCCQ has been run with mi = 103, i and k {1, 2, 4, 8}, where, due to space restrictions, the plot shows the average over all SDCCQ runs. Individual runs are within a standard deviation of 5. Clearly, the error increases for increasing coupling strengths. However, SDCCQ is often close to the MF lower bound or the TRW upper bound. WISH, delivers the most accurate result but has by far the largest complexity per run.

Scalability. Finally, we analyze the empirical runtime on randomly parametrized Ising grids with 128 128 vertices. Since the major computational cost in Algorithms 1 and 2 arise through a large summation and the sampling step, respectively, both are easy to parallelize. We restricted the execution to a subset of the available CPU cores and measured the runtime in seconds. The result is shown in Fig. 1 (iv), where SDCCQ scales well with an increasing number of CPU cores.

Conclusion. For the ﬁrst time, we exploited quadrature rules to construct an approximation to the partition function of probabilistic models with arbitrary structure. We derived bounds on the error and compared the new algorithm to other methods. It turned out that the new method scales well on multi-core systems and that it delivers superior results, whenever the norm of the model s parameter vector is small. Moreover, complexity-decoupling allows us to run probabilistic inference with error guarantees on small, resource-constrained devices.

Stochastic Discrete Clenshaw-Curtis Quadrature

Acknowledgments

This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) within the collaborative research center SFB 876, project A1.

Bernstein, Sergei Natanovich. Sur la meilleure approximation des fonctions continues par les polynˆomes du degr e donn e. i. Communications de la Soci et e math ematique de Kharkow. 2ee s erie, 13(2 3):49 144, 1912.

Bethe, Hans A. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 150(871):552 575, 1935.

Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003.

Bulatov, Andrei and Grohe, Martin. The complexity of partition functions. In Daz, Josep, Karhumki, Juhani, Lepist, Arto, and Sannella, Donald (eds.), Automata, Languages and Programming, volume 3142 of Lecture Notes in Computer Science, pp. 294 306. Springer, Heidelberg, Germany, 2004.

Clenshaw, C. W. and Curtis, A. R. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197 205, 1960.

Clifford, Peter. Markov random ﬁelds in statistics. In Disorder in physical systems, Oxford Science Publications, pp. 19 32. Oxford University Press, New York, 1990.

Cover, Thomas M. and Thomas, Joy A. Elements of Information Theory. John Wiley & Sons, New York, NY, USA, 2nd edition, 2006. ISBN 978-0471241959.

Ermon, Stefano, Gomes, Carla P., Sabharwal, Ashish, and Selman, Bart. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In 30th International Conference on Machine Learning, pp. 334 342, 2013.

Fraser, W. A survey of methods of computing minimax and near-minimax polynomial approximations for functions of a single independent variable. Journal of the ACM, 12 (3):295 314, July 1965.

Frey, Brendan J. Local probability propagation for factor analysis. In Solla, S.A., Leen, T.K., and M uller, K. (eds.), Advances in Neural Information Processing Systems, volume 12, pp. 442 448. MIT Press, Cambridge, MA, USA, 2000.

Gautschi, W. Questions of numerical condition related to polynomials. Studies in Numerical Analysis, (24):140 177, 1985.

Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

Goldberg, Leslie Ann and Jerrum, Mark. A polynomialtime algorithm for estimating the partition function of the ferromagnetic Ising model on a regular matroid. SIAM Journal on Computing, 42(3):1132 1157, 2013.

Gr unwald, Peter D. The Minimum Description Length Principle. The MIT Press, Cambridge, MA, USA, 2007. ISBN 978-0262072816.

Han, Insu, Malioutov, Dmitry, and Shin, Jinwoo. Largescale log-determinant computation through stochastic Chebyshev expansions. In 32nd International Conference on Machine Learning, pp. 908 917, 2015.

Heskes, Tom. Stable ﬁxed points of loopy belief propagation are local minima of the Bethe free energy. In Becker, S., Thrun, S., and Obermayer, K. (eds.), Advances in Neural Information Processing Systems, volume 15, pp. 343 350, 2002.

Jaakkola, Tommi S. and Jordan, Michael I. Computing upper and lower bounds on likelihoods in intractable networks. In 12th Annual Conference on Uncertainty in Artiﬁcial Intelligence, pp. 340 348. Morgan Kaufmann Publishers, 1996.

Koller, Daphne and Friedman, Nir. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009. ISBN 978-0262013192.

Kschischang, Frank R., Frey, Brendan J., and Loeliger, Hans-Andrea. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2): 498 519, 2001.

Lauritzen, Steffen L. Graphical Models. Oxford University Press, Oxford, UK, 1996. ISBN 978-0198522195.

Lauritzen, Steffen L. and Spiegelhalter, David J. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), 50 (2):157 224, 1988.

Liu, Qiang, Peng, Jian, Ihler, Alexander, and Fisher III, John. Estimating the partition function by discriminance sampling. In 31st Annual Conference on Uncertainty in Artiﬁcial Intelligence, pp. 514 522. AUAI Press, 2015.

Stochastic Discrete Clenshaw-Curtis Quadrature

Mason, J.C. and Handscomb, David C. Chebyshev Polynomials. Chapman and Hall/CRC, 1st edition, 2002. ISBN 978-0849303555.

Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Burlington, MA, USA, 1988. ISBN 978-1558604797.

Pitman, Edwin James George. Sufﬁcient statistics and intrinsic accuracy. Mathematical Cambridge Philosophical Society, 32:567 579, 1936.

Ranganath, Rajesh, Tang, Linpeng, Charlin, Laurent, and Blei, David M. Deep exponential families. In Lebanon, Guy and Vishwanathan, S. V. N. (eds.), 18th International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of JMLR W&CP. JMLR.org, 2015.

Ruozzi, Nicholas. The Bethe partition function of logsupermodular graphical models. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 25, pp. 117 125. Curran Associates, Inc., 2012.

Schraudolph, Nicol N. and Kamenetsky, Dmitry. Efﬁcient exact inference in planar Ising models. In Koller, Daphne, Schuurmans, Dale, Bengio, Yoshua, and Bottou, L eon (eds.), Advances in Neural Information Processing Systems, volume 21, pp. 1417 1424, 2008.

Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random ﬁelds. Foundations and Trends in Machine Learning, 4(4):267 373, 2011.

Valiant, Leslie G. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3): 410 421, 1979.

Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2): 1 305, 2008.

Wainwright, Martin J., Jaakkola, Tommi S., and Willsky, Alan S. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51 (7):2313 2335, 2005.

Weiss, Yair. Comparing the mean ﬁeld method and belief propagation for approximate inference in MRFs. In Opper, M. and Saad, D. (eds.), Advanced Mean Field Methods:Theory and Practice, pp. 229 239. MIT Press, Cambridge, MA, USA, 2001.

Weller, Adrian and Jebara, Tony. Clamping variables and approximate inference. In Ghahramani, Zoubin, Welling, Max, Cortes, Corinna, Lawrence, Neil D., and

Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 909 917, 2014.

Xiang, Shuhuang, Chen, Xiaojun, and Wang, Haiyong. Error bounds for approximation in Chebyshev points. Numerische Mathematik, 116(3):463 491, 2010.

Yedidia, Jonathan S., Freeman, William T., and Weiss, Yair. Exploring artiﬁcial intelligence in the new millennium. chapter Understanding Belief Propagation and its Generalizations, pp. 239 269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.