# stochastic_discrete_clenshawcurtis_quadrature__8e5f10d4.pdf Stochastic Discrete Clenshaw-Curtis Quadrature Nico Piatkowski NICO.PIATKOWSKI@TU-DORTMUND.DE Artificial Intelligence Group, TU Dortmund, Germany Katharina Morik KATHARINA.MORIK@TU-DORTMUND.DE Artificial Intelligence Group, TU Dortmund, Germany The partition function is fundamental for probabilistic graphical models it is required for inference, parameter estimation, and model selection. Evaluating this function corresponds to discrete integration, namely a weighted sum over an exponentially large set. This task quickly becomes intractable as the dimensionality of the problem increases. We propose an approximation scheme that, for any discrete graphical model whose parameter vector has bounded norm, estimates the partition function with arbitrarily small error. Our algorithm relies on a near minimax optimal polynomial approximation to the potential function and a Clenshaw-Curtis style quadrature. Furthermore, we show that this algorithm can be randomized to split the computation into a high-complexity part and a low-complexity part, where the latter may be carried out on small computational devices. Experiments confirm that the new randomized algorithm is highly accurate if the parameter norm is small, and is otherwise comparable to methods with unbounded error. 1. Introduction Graphical models serve as the underlying framework of various machine learning techniques and facilitate important real world applications from computer vision, computational biology, signal processing and natural language processing, to mention just a few (Pearl, 1988; Lauritzen, 1996; Koller & Friedman, 2009). When using these models, a central problem is that of computing the partition function, since it is required for computing probabilities from the model and it plays a fundamental role in marginal inference (Jaakkola & Jordan, 1996), maximum- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). likelihood and maximum-a-posteriori parameter estimation (Wainwright & Jordan, 2008), and model selection (Cover & Thomas, 2006; Gr unwald, 2007). Unfortunately, exact computation of the partition function is an intractable (#P-complete) problem for many graphical models of interest, such as those involving a non-tree conditional independence structure (short: structure) or highorder cliques/factors. Moreover, chances to find polynomial time algorithms for these tasks are rather low, unless P = NP. For some special cases, polynomial-time algorithms are known (Schraudolph & Kamenetsky, 2008; Goldberg & Jerrum, 2013). However, significant research efforts have been conducted in order to derive approximations for more general cases. There are two well established general approaches to approximate large discrete sums: variational methods and sampling. Methods which provide reasonable tight error bounds that hold for any structure are either not available or require to solve a series of hard combinatorial optimization problems (Ermon et al., 2013). Based on work from statistical physics, variational methods (Wainwright & Jordan, 2008) are often very fast but do not provide quality guarantees. Loopy belief propagation (LBP) (Pearl, 1988; Kschischang et al., 2001) is well known for computing locally consistent solutions. Nevertheless, this procedure might not converge and if it does, the result is a local minimum of a surrogate objective (Bethe free energy). Tree-reweighted approaches (Wainwright & Jordan, 2008) deliver the tightest upper bound, but the exact error can not be quantified. In addition, the approximation error depends on the structure of the corresponding probability density. In contrast, the worst-case error of our method is bounded and independent of the structure. Sampling based techniques are popular, but they suffer from similar issues because the number of samples required to obtain a statistically reliable estimate for multidimensional random variables grows exponentially fast. Markov Chain Monte Carlo (MCMC) methods are asymptotically accurate, but guarantees exist only for certain spe- Stochastic Discrete Clenshaw-Curtis Quadrature cial cases. Their performance depend crucially on the choice of the proposal distribution, which often must be domain-specific and expert-designed (Girolami & Calderhead, 2011). Importance sampling methods (Liu et al., 2015) try to correct for biased proposal distributions, but their error is hard to quantify as well. In contrast to existing approaches, our method is based on a numerical approximation technique, which is called quadrature. It allows us to derive a scalable approximation procedure for partition functions, that is applicable to any structure, and delivers guarantees on the approximation error. Surprisingly, quadrature based approximations do not appear in the machine learning literature. The technically most related article is a recent method to compute the logdeterminant of large matrices (Han et al., 2015), based on Chebyshev polynomials. Our main contributions can be summarized as follows: We present a new deterministic algorithm called Discrete Clenshaw-Curtis Quadrature (DCCQ). It gets a structure G, a parameter vector θ Rd and a polynomial degree k as inputs, and outputs ˆZk(θ) such that |Z(θ) ˆZk(θ)| ε 2Z(θ) in time O(k2n2k) whenever k ω( θ 2 ln ε). Moreover, we show how to split the computation of our approximation into an expensive part, that depends on the structure G and the state space X, and a cheap part, which depends on the model parameters θ. If the target platform has limited computational resources (like a mobile phone or a sensor), the expensive computation may be carried out on a server. Since the expensive part is independent of the model parameters θ, it can be re-used for any repetitive computation of the partition function a setting that is typical in iterative parameter estimation procedures. To this end, we present a randomized algorithm, called Stochastic Discrete Clenshaw-Curtis Quadrature (SDCCQ). It gets G, θ, k and m as inputs and outputs Zm k (θ) such that P[|Z(θ) Zm k (θ)| εZ(θ)] ζ in time O(k2n2k) + O(k2mmax), whenever k ω( θ 2 ln ε) and large enough mi. Numerical experiments on grid structures show, that the new SDCCQ delivers highly accurate approximations while not suffering from high runtimes, convergence issues or problems with mixed-type potentials, whenever the norm of the parameter vector is small. As opposed to existing methods, our new approach (i) allows for a user-specified trade-off in terms of runtime and approximation quality by a single parameter, (ii) can outputs a bound of its own worst-case error, (iii) has no convergence issues, (iv) and allows us to split the computational complexity and re-use existing results. 2. Notation and background In this section, the preliminaries of our new approximation to the partition function are explained. We provide some background on graphical models, followed by a short recap of quadrature rules and polynomial approximation. 2.1. Graphical Models Consider a multi-variate random variable X where each Xi with 1 i n takes values xi from space Xi. The concatenation of all n variables yields the random variable X with product state space X = X1 X2 Xn. We denote assignments to single vertices v V by xv = y with y Xv. For any subset of variables U V , XU = N v U Xv is its joint state space. Moreover, for any {v, u, w} V , we define vuw := {v, u, w} to simplify notation. The formalism of exponential families provides a unifying framework for a large variety of probability distributions. In particular, they cover all types of probabilistic graphical models (Lauritzen, 1996; Koller & Friedman, 2009) including Bayesian networks (Pearl, 1988), Markov random fields (Wainwright & Jordan, 2008), conditional random fields (Sutton & Mc Callum, 2011), logistic regression, latent Dirichlet allocation (Blei et al., 2003) and recent deep models (Ranganath et al., 2015). Probability densities which are member of an exponential family can be written as Pθ(x) = exp( θ, φ(x) A(θ)) = 1 Z(θ)ψ(x). (1) Beside its parameter vector θ Rd, Pθ consists of two major ingredients, namely the potential ψ(x) = exp( θ, φ(x) ) and the partition function Z(θ) = exp(A(θ)). Let us first have a closer look at the potential. It assigns a positive real number to every possible instance x by mapping it from X into a real vector space via a sufficient statistic (or feature map) φ : X Rd, where φ encodes the structure of X. We ease the notation by identifying components of X with vertices of a graph G = (V, E). In discrete state space models, the state of variables is encoded by indicator functions: φv=y(x) = 1{xv=y} (2) φvu=yz(x) = 1{xv=y}1{xu=z} (3) φU=y(x) = Y v U 1{xv=yv}, U V (4) To exactly represent the probability mass of a multi-variate random variable X, φ(x) has to contain at least one indica- Stochastic Discrete Clenshaw-Curtis Quadrature tor per possible clique assignments for all maximal cliques1 of a structure G (Clifford, 1990). If C(G) is the set of maximal cliques of G, then φ(x) = (φU=u(x) : U C(G), u XU) . Sufficiency of φ is declared with respect to θ, i.e., knowledge about x is not required to infer θ, once φ(x) is known. This property is in particular useful when exponential families have to be applied in resource constrained, autonomous, medical or self quantification devices, because arbitrary large data sets may be aggregated into a finite dimensional representation. In fact, only members of exponential families have this property (Pitman, 1936). We now identify another property of sufficient statistics which will become crucial for finding an efficient inference procedure: Definition 1. The sufficient statistics φ : X Rd is called χ-integrable if χφ : [d]k R with dν(x), k N, j [d]k admits a closed-form expression, that is computable in polynomial time. [d]k is an abbreviation for {1, 2, . . . , d}k. In fact, if φ is composed out of indicator functions like (2), (3) or (4), it is χ-integrable. We will prove this statement in Section 4. Note, however, that our method is not restricted to binary sufficient statistics or discrete state space models any χ-integrable φ suffices. The partition function X ψ(x)dν(x) (5) accumulates the potential of every possible instance and ensures normalization of Pθ. It is defined w.r.t. a reference measure ν. For continuous (discrete) state spaces, ν is the Lebesgue (counting) measure. If the graph G contains no loops, i.e., is tree-structured, Z(θ) can be evaluated exactly in polynomial time. The same holds for planar Ising models (Schraudolph & Kamenetsky, 2008). If no special property of the underlying structure can be exploited, evaluating Z(θ) is hard in fact, #P-complete (Valiant, 1979; Bulatov & Grohe, 2004). The junction tree (JT) algorithm (Lauritzen & Spiegelhalter, 1988; Wainwright & Jordan, 2008) can be applied to compute the exact value Z(θ) with runtime exponential in the size of the largest clique of the triangulation of graph G. Nevertheless, several approximate methods arose in the last decades. The Bethe approximation (Bethe, 1935; Wainwright & Jordan, 2008), where the main underlying idea is to treat a general graph like a tree, is maybe the most popular, computed by 1Cliques are fully connected subgraphs. A clique is maximal if it is not contained in any other clique. loopy belief propagation (Pearl, 1988; Frey, 2000; Kschischang et al., 2001). Its runtime depends on the number of message passing iterations until convergence. This number is in general unknown and the algorithm might not even converge. If no further assumptions are made, the Bethe approximation delivers an estimate of unknown quality. It has been shown quite recently, that in case of log-supermodular potential functions (Ruozzi, 2012; Weller & Jebara, 2014), the Bethe approximation is a lower bound on the partition function. Also the naive mean field (MF) technique (Weiss, 2001; Wainwright & Jordan, 2008) is known to provide a lower bound on Z(θ) (Weiss, 2001). The best known general upper bound is based on convex combinations of exponential family parameters θ(T) which correspond to spanning-trees T T (G) of the original graph G, known as tree-reweighted belief propagation (TRW) (Wainwright et al., 2005). An overview is provided in Table 1. 2.2. Quadrature All of the existing approaches mentioned above are based on an approximation of the structure. In contrast, we propose a numerical approximation of the integration, based on the general quadrature. If integrating a function is not tractable, one has to resort to numerical methods in order to approximate the definite integral. The basic idea of a quadrature rule is to replace the integrand f by an approximation h f that admits tractable integration. It turns out that choosing h = hk to be a degree-k Chebyshev polynomial approximation of f has outstanding properties like rapidly decreasing and individually converging coefficients (Gautschi, 1985). The general quadrature procedure can be summarized as Z u l f(x)dx Z u l hk(x)dx = i=0 wif(xi) (6) where x R, wi are certain coefficients and xi are certain abscissae in [l, u] (all to be determined) (Mason & Handscomb, 2002). Depending on the choice of interpolation points and different kinds of orthogonality properties, Chebyshev polynomial based quadrature rules are termed Gauss-Chebyshev quadrature, Fej er quadrature or Clenshaw-Curtis quadrature (Clenshaw & Curtis, 1960). We make use of discrete orthogonality properties and initialize our approximation at the zeros of second kind Chebyshev polynomials, which is a Clenshaw-Curtis quadrature. Chebyshev Polynomials. In order to construct a quadrature rule that is numerically well behaved, Chebyshev Polynomials Tk(x) are chosen as a basis. The fundamental recurrence relation is a convenient representation: T0(x) = 1, T1(x) = x, Tk(x) = 2x Tk 1(x) Tk 2(x). (7) Stochastic Discrete Clenshaw-Curtis Quadrature Table 1. State-of-the-art methods to compute/approximate the partition function of an undirected model with graph G = (V, E), n = |V |, m = |E|. MCMC-based methods are omitted. Here, I is the number of iterations until convergence. L = maxv V |Xv| and = maxv V |Nv| are the largest vertex domain and neighborhood size, respectively. w is the tree-width of G and d is the dimension of the parameter space. kε is the polynomial degree (implied by ε) as specified by Theorem 5. mζ = maxi mi is the number of samples (implied by ζ) from the distribution (11) as specified by Theorem 7. Algorithm Complexity Quality JT (Lauritzen & Spiegelhalter, 1988) O(Lw) Exact MF (Weiss, 2001) O(In L ) Lower bound LBP (Heskes, 2002; Yedidia et al., 2003) O(Im L2 ) Local minimum of Bethe free energy TRW (Wainwright et al., 2005) O(Im L2 + m log n) Upper bound WISH (Ermon et al., 2013) O(n ln(n/ζ)) Time(MAP) (16, ζ)-approx DCCQ (Alg. 1) O(k2 εd2kε) ε-approx (Theorem 5) SDCCQ (Alg. 2) O(k2 εd2kε) + O(k2 εmζ) (ε, ζ)-approx (Theorem 7) It can be easily verified that each Tk is a polynomial of degree-k. The degree-k Chebyshev interpolation hk of a continuous function f is hk(x) = Pk i=0 ci Ti(x) with ci = 2 k + 1 j=1 f(xj)Ti(xj). (8) The interpolation points xj = cos jπ k are given by the extrema of the corresponding Chebyshev polynomial which are, at the same time, the zero of the Chebyshev polynomials of the second kind. The major distinguishing properties of Tk are (i) discrete orthogonality, which is required for the derivation of the coefficients (8) and (ii) the fact that 21 k Tk is the minimax approximation to the 0-function on [ 1, 1] (Mason & Handscomb, 2002). Let h H be an approximation to f on the domain [l, u]; h is called minimax approximation to f iff h H : f h = sup x [l,u] |f(x) h(x)| f h Chebyshev showed that an outstanding property of any minimax approximation is an oscillating error curve. The coefficients given by (8) result in an approximation to the theoretically optimal (minimax) approximation (Mason & Handscomb, 2002) and allow for a rather fast computation in time O(k log k) via discrete cosine transformation (DCT). Coefficients which deliver an approximation that is even closer to the optimal one can be obtained by the Remez exchange algorithm (Fraser, 1965). Error functions for DCT and Remez coefficients are shown in Figure 1 (i). Since both approaches deliver high quality approximations, we use DCT because of its superior efficiency. Finally, the coefficients c, which are coefficients of Chebyshev polynomials, can be converted to native coefficients c coefficients of powers of x, say xi by summing the corresponding coefficients that appear in the Chebyshev polynomials, weighted by ci. Every Chebyshev interpolation can thus be equivalently expressed as i=0 cixi. (9) Although the general quadrature is known since long, we, for the first time, exploit it for approximating the partition function. 3. Algorithm We start with the intuition behind our algorithm for approximating Z(θ) called Discrete Clenshaw-Curtis Quadrature. Indeed, we aim at approximating (5) by a quadrature rule (6). In contrast to (6), the integration in (5) has to be carried out over the n-dimensional set X. Note, however, that evaluating the potential at x is equivalent to computing exp(r) for r = θ, φ(x) . Integrating ψ over X is thus equivalent to integrating exp over W = {r R : r = θ, φ(x) , x X}. But without any further assumption on W, the definite integral R W exp(z)dz continues to have no closed-form expression. We hence take hk to be the Chebyshev interpolation of exp on [l, u] with l = min W and u = max W. Moreover, hk may be interpreted as approximation ˆψk(x) ψ(x) over X. Computing max W is NP-hard, since it is equivalent to computing the maximuma-posteriori assignment of the underlying graphical model. We may use an upper bound on max W instead: Remember that C(G) is the number of (maximal) cliques in G. For binary sufficient statistics (2), (3) and (4), it holds Stochastic Discrete Clenshaw-Curtis Quadrature Algorithm 1 DCCQ Input: θ Rd, k N, φ Output: Approximate partition function ˆZk(θ) 1: [l, u] interval(θ) 2: c coefficients(k, [l, u]) // Eq. (8) 3: ˆZk(θ) Pk i=0 ci P j [d]i χφ(j) Qi l=1 θjl that x : φ(x) 2 = p |C(G)|, since each clique is in exactly one state. Therefore, max W θ 2B which follows directly from the Cauchy-Schwarz inequality with B = p Now, we can approximate ψ on [l, u] by Chebyshev polynomials, and hence: X ψ(x)dν(x) Z X ˆψk(x)dν(x) X hk( θ, φ(x) )dν(x) i=0 ci θ, φ(x) idν(x) χφ(j) =: ˆZk(θ). (10) Evaluating the last line has an asymptotic runtime of O(k2n2k) Time(χφ) which is polynomial in the size of the graph. The procedure is summarized in Algorithm 1. We will investigate in Section 4 which k is required to achieve a reasonable accuracy. 3.1. Complexity-decoupling via Randomization. O(k2n2k) is lower than the time that it takes to enumerate the full state space X, but the partition function has to be recomputed every time the model changes. This happens frequently during iterative parameter learning procedures. Moreover, it is well known that the complexity of inference is mainly influenced by the structure (Lauritzen & Spiegelhalter, 1988; Wainwright & Jordan, 2008). So when this structure does not change, is it necessary to redo this costly computation during learning? It turns out that if we combine our Algorithm 1 with a probabilistic sampling scheme, it is possible to decouple the most costly computation from the rest. Surprisingly, the costly part depends only on the structure while the feasible part combines it with the parameters. To see this, notice that the function χφ : [d]k R is non-negative, hence Pφ(j | k) = Qφ(k) 1χφ(j) (11) with Qφ(k) = P j [d]k χφ(j ) defines a proper probabil- ity mass function (pmf) on [d]k. ˆZk(θ), as given by (10), Algorithm 2 SDCCQ Input: θ Rd, k N, m Nk, φ Output: Approximate partition function Zm k (θ) 1: [l, u] interval(θ) 2: c coefficients(k, [l, u]) 3: for i = 1 to k do 4: if Qφ(i) not cached then 5: Qφ(i) P j [d]i χφ(j) 6: cache Qφ(i) 7: Zm k (θ) 0 8: for i = 1 to k do 9: sum 0 10: for r = 1 to mi do 11: sample j Pφ(J | i) 12: sum sum + Qi l=1 θjl 13: Zm k (θ) Zm k (θ) + ci Qφ(i) 1 mi sum may now be rewritten as follows: j [d]k χφ(j) i=0 ci Qφ(i) X j [d]k Pφ(j | i) l=1 θJl | i where wi = ci Qφ(i) and the expectation is taken with respect to the random variable J with pmf Pφ(J = j | i) as defined in (11). The runtime has not yet improved compared to (10). However, notice that the expectation EJ h Qi l=1 θJl | i i may be approximated via ˆEJ h Qi l=1 θJl | i i = 1 m Pm r=1 Qi l=1 θj(r) l by drawing m samples from Pφ(J | i). Most important, Qφ(i) will not change as long as the structure G does not change. This procedure is summarized in Algorithm 2. It is clear from line 4 that the Qφ(i) are only computed once. If they are cached, the runtime is O(k2 maxi mi). In Section 4 it is shown which m suffice to guarantee a small approximation error. However, we can anticipate that mi dk. We like to stress again that the Qφ(i) will not change as long as the graphical structure and the state space X are fixed, moreover the Qφ(i) need not be computed on the same machine. It is reasonable to assume that the Qφ(i) are computed on a large cluster while the inference is eventually carried out on a small device with low computational resources or energy constraints. Stochastic Discrete Clenshaw-Curtis Quadrature Approximation error -4 -2 0 2 4 Taylor DCT Remez 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 8 16 24 32 k=2 k=4 k=8 x θ 2 θ 2 Num. of CPUs Figure 1. From left to right: (i): Absolute error of degree-16 Taylor, Remez and DCT approximations of exp(x) on the interval [ 5, 5]. Only Remez and DCT have oscillating error curves. (ii): Theoretical bounds for the absolute error |Z(θ) ˆZk(θ)| of DCCQ (Alg. 1) as a function of the parameter norm for a 4 4 Ising grid model, predicted by Lemma 3. (iii): Absolute error |Z(θ) Zm k (θ)| of SDCCQ (Alg. 2) as a function of the parameter norm for a 4 4 Ising grid model. i : mi = 104. (iv): Scalability of SDCCQ (Alg. 2). Runtime in seconds as a function of the number of CPU cores for different polynomial degrees. Best viewed in color. 4. Analysis The most important prerequisite for our algorithms is χintegrability of the sufficient statistic. Lemma 2. The sufficient statistics of a discrete state space model, constructed from binary indicator functions, (2), (3) and (4), is χ-integrable. Proof. The state space is discrete and thus the definition of χ-integrability becomes , k N, j [d]k Since φ(x)ji is binary, the product Qk i=1 φ(x)ji is binary too. The summation therefore equals the number of instances x X for which the product evaluates to 1 for an arbitrary but fixed j. We will derive this number now: As described in Section 2, each index of the vector φ(x) corresponds to a particular assignment of a value y to a clique C. Since j may be any k-dimensional vector with elements from {1, 2, . . . , d}, it might happen that it contains incompatible indices. Two incompatible indices correspond to two different assignments to the same vertex, which is not possible. Since the summation is over all possible instances, the product can never evaluate to 1 for vectors j which contain incompatible indices, and hence χφ(j) = 0. Checking if a vector contains a pair of incompatible indices can be done with o(k2) comparisons. Any j that contains only compatible indices, corresponds to a possible assignment y U(j) to an induced subset U(j) V of vertices. The product will evaluate to 1 for every instance x that matches this particular assignment, i.e., x U(j) = y U(j). Fixing the values of the vertices in U(j) results in a remaining state space of size |X|/|XU(j)|, which is exactly the number of times that the product will evaluate to 1. This finally implies that iff j contains no incompatible indices, then χφ(j) = |X|/|XU(j)| and 0 otherwise. This can of course be computed in polynomial time and φ is thus χintegrable. In the rest of this section, we present our main theorems which provide bounds on the error of Algorithms 1 and 2. Formally, the following error bound is known for Chebyshev approximations (Xiang et al., 2010). Lemma 3. Suppose f is analytic in the region bounded by the ellipse Eρ = {z C : |z + z2 1| = ρ} with major and minor semi-axis lengths summing to ρ > 1, foci at 1, and maxz Eρ |f(z)| M. Let gk denote the degree-k interpolant of f according to Eq. (9), then for each k 0, max z { 1,1} |f(z) gk(z)| 4M (ρ 1)ρk . To apply this Lemma in the sequel, we will consider the function f = exp γ with γ : [l, u] [ 1, 1] and γ(x) = l+u 2 (Bernstein, 1912; Mason & Handscomb, 2002). Any choice of ρ > 1 would suffice, but increasing ρ increases the imaginary axis of E, and, at the same time, both numerator and denominator of the error bound. On the other hand, for ρ = 1, Eρ collapses to the real line, which is not compatible with the Lemma. We set ρ = 1 + 2 as suggested in (Xiang et al., 2010). Lemma 4. Mean field lower bound. Any mean parameter µ M with µ = P x P(x)φ(x) yields a lower bound on the partition function: A(θ) θ, µ A (µ), where A is the convex conjugate of A = ln Z. (Wainwright & Jordan, 2008) It can be shown that A is equal to the Shannon entropy H(P) = P x P(x) ln P(x). Theorem 5. Deterministic approximation error. Let φ be χ-integrable, ε > 0, k (ln 8 + 2 ln M ln(ε(ρ 1)))(ln ρ) 1 and x : φ(x) 2 B. Then, the Stochastic Discrete Clenshaw-Curtis Quadrature output ˆZk(θ) of Algorithm 1 satisfies |Z(θ) ˆZk(θ)| ε Proof. The idea is to show that the lower bound from Lemma 4 is also an upper bound for the right-hand-side in Lemma 3. To see this, construct the naive mean field variational lower bound, based on a fully factored joint probability P(X = x) = Q v V Pv(Xv) with uniform vertex marginals Pv(Xv) = U(Xv) (here we assume that the uniform distribution on Xv exists) for all v V . This implies for edge marginals that Pvu(Xv, Xu) = Pv(Xv)Pu(Xu). Plugging this into Lemma 4, we get A(θ) θ, µ + H(µ) θ 2 µ 2 + H(µ) 1 |Xv| ln 1 |Xv| v V ln |Xv| since φ(x) 2 B µ 2 B. Now, let M = maxz Eρ | exp(γ(z))| = exp(γ( 2)) = exp((l + u)/2 + 2(u l)/2) > exp((l + u)/2 + (u l)/2) = exp( θ 2B) maxx ψ(x) be an upper bound on the potential, where we set u = θ 2B as suggested in Section 3. By applying Lemma 3, it follows that |Z(θ) ˆZk(θ)| = | X x |ψ(x) ˆψ(x)| |X| 4M (ρ 1)ρk Notice that the occurrence of θ 2 is an artifact of the Cauchy-Schwarz inequality. One may conduct the same derivation based on H older s inequality, which in turn would replace the term θ 2B by θ 1B . Now, we analyze the error that we have to tolerate, if we want to apply complexity-decoupling as explained in Section 3.1. Therein, additional error is introduced by approximating the expected value EJ[Qi l=1 θJl | i] with the sample mean ˆEm[Qi l=1 θJl | i] := 1 m Pm r=1 Qi l=1 θj(r) l , where each j(r) is sampled from the distribution with pmf (11). Indeed, ˆEm is unbiased and converges to EJ. However, we would like to quantify the amount of samples that is required to achieve a desired accuracy. Theorem 6. Let X = X(J) = Qi l=1 θJl, where J is the random variable with pmf (11). Moreover, let εi > 0, ζ0 (0, 1) and m (ln ζ0)/(2 θ i εi | ln M + ln |X||), where M is an upper bound on the potential, then P EJ [X | i] ˆEm [X | i] εi |Z (θ)| ζ0. Proof. Applying a Chernoff-like argument, followed by the triangle inequality and Lemma 4, we have P ˆEm [X | i] E [X | i] εi |Z (θ)| + m |E [X]| mεi |Z (θ)| exp m2 θ i mεi |Z (θ)| exp m 2 θ i εi | ln M + ln |X|| = ζ0 Theorem 7. Stochastic approximation error. Let the preconditions of Theorem 5 hold. Furthermore, let ζ (0, 1). Algorithm 2 with mi (ln ζ k)/(2 θ i ε 2k|wi| |ln |X| ln M|) delivers an (ε, ζ)-approximation of the partition function. In particular, P |Z(θ) Zm k (θ)| εZ(θ) ζ. Proof. Let X = X(J) = [Qi l=1 θJl | i] and wi = ci Qφ(i). We conclude, that P |Z(θ) Zm k (θ)| εZ(θ) i=1 |EJ[X | i] ˆEmi[X | i]| ε 2|wi|Z(θ) i=1 P |EJ[X | i] ˆEmi[X | i]| ε 2k|wi|Z(θ) from the triangle inequality and Theorem 5. Now, the claim follows from Theorem 6 with εi = ε 2k|wi| and ζ0 = ζ Notice that c is required to compute m. But this is not restrictive, since c can be computed easily with (8) followed by a conversion to native coefficients. Hence, it is safe to assume that the c are available for computing m. 5. Numerical Evaluation For our experiments, we implemented DCCQ (Alg. 1) and SDCCQ (Alg. 2) and execute them on a machine with 40 E5-2697 Xeon CPU cores. The numerical evaluation should show that SDCCQ achieves highly accurate results whenever the norm of the parameter vector is small. In addition, we investigate cases when the norm is not small and Stochastic Discrete Clenshaw-Curtis Quadrature Approximation error 0 0.5 1 1.5 2 2.5 3 Coupling strength ω 0 0.5 1 1.5 2 2.5 3 Coupling strength ω 0 0.5 1 1.5 2 2.5 3 Coupling strength ω 0 0.5 1 1.5 2 2.5 3 Coupling strength ω Figure 2. Estimation errors for the log-partition function on 10 10 Ising grids with randomly generated parameter vectors for various coupling strengths. (i): attractive, κ = 0.1, (ii) attractive, κ = 1.0, (iii) mixed, κ = 0.1, (iv) mixed, κ = 1.0. Best viewed in color. compare our method to the approaches which are listed in Table 1. Finally, we analyze the scalability of SDCCQ in terms of parallel computing. One of the main benefits of SDCCQ is the complexitydecoupling. This allows us to compute the Qφ(i) (see Alg. 2) only once per structure and reuse them in every run2. In total, we computed the Qφ(i) for Ising grid models with sizes 2 2, . . . , 128 128 and i {1, 2, . . . , 16}. Parameters with small norm. First of all, we show that SDCCQ achieves small errors when the norm of θ is small ( 1). Therefore, we conducted a 4 4 Ising grid with random Gaussian parameters, i.e., θi N(0, σ). We varied σ from 10 4 to 1 in order to generate models with various norms. The partition function was then estimated by SDCCQ for polynomial degrees in {2, 4, 6, 8, 10, 12} with mi = 103, i, while the correct value of the partition function was computed by the JT algorithm. In Figure 1 (iii), the average absolute estimation error, |Z(θ) Zm k (θ)| (yaxis), is plotted against the norm of θ (x-axis). For comparison, the theoretical error curves which are implied by Lemma 3 are depicted in Figure 1 (ii). Empirically, the error of the stochastic low-degree approximations is lower than predicted while the error of the high-degree approximations is higher than predicted. Note, however, that the error bound of SDCCQ (Theorem 7) is stochastic, i.e., an ε-approximation is achieved with a certain probability. Further experiments show that the error can indeed be decreased by increasing the mi. In this setting, an absolute error of 12000 corresponds to a relative error of ε < 0.2. Parameters with large norm. In the second experiment, we investigate the quality of our new method when the norm of the parameter vector is 1 and hence, Theorem 7 predicts a high error for small polynomial degrees. Nevertheless, the question is how SDCCQ compares to other approaches. We use the same experimen- 2Our C++ source code and the precomputed Qφ(i) values are available at http://sfb876.tu-dortmund.de/sdccq. tal setup as in (Ermon et al., 2013). Specifically, we have n = 10 10 binary variables x { 1, 1}n with weights θvu=xy = wvu whenever x = y and θvu=xy = wvu otherwise. In the attractive setting, the wvu are drawn from [0, ω]; in the mixed setting, from [ ω, ω]. Moreover, vertex weights θv=1 = θv= 1 = wv are sampled from [ κ, κ] with κ {0.1, 1.0}. This setting implies that θ 2 is in the range [3.86, 44.53], which is far outside of the interval in which we should expect small errors. Figure 2 shows the estimation error for the log-partition function, i.e. ln Zm k (θ) ln Z(θ), averaged over 5-folds of SDCCQ, WISH, MF, TRW and LBP. SDCCQ has been run with mi = 103, i and k {1, 2, 4, 8}, where, due to space restrictions, the plot shows the average over all SDCCQ runs. Individual runs are within a standard deviation of 5. Clearly, the error increases for increasing coupling strengths. However, SDCCQ is often close to the MF lower bound or the TRW upper bound. WISH, delivers the most accurate result but has by far the largest complexity per run. Scalability. Finally, we analyze the empirical runtime on randomly parametrized Ising grids with 128 128 vertices. Since the major computational cost in Algorithms 1 and 2 arise through a large summation and the sampling step, respectively, both are easy to parallelize. We restricted the execution to a subset of the available CPU cores and measured the runtime in seconds. The result is shown in Fig. 1 (iv), where SDCCQ scales well with an increasing number of CPU cores. Conclusion. For the first time, we exploited quadrature rules to construct an approximation to the partition function of probabilistic models with arbitrary structure. We derived bounds on the error and compared the new algorithm to other methods. It turned out that the new method scales well on multi-core systems and that it delivers superior results, whenever the norm of the model s parameter vector is small. Moreover, complexity-decoupling allows us to run probabilistic inference with error guarantees on small, resource-constrained devices. Stochastic Discrete Clenshaw-Curtis Quadrature Acknowledgments This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) within the collaborative research center SFB 876, project A1. Bernstein, Sergei Natanovich. Sur la meilleure approximation des fonctions continues par les polynˆomes du degr e donn e. i. Communications de la Soci et e math ematique de Kharkow. 2ee s erie, 13(2 3):49 144, 1912. Bethe, Hans A. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 150(871):552 575, 1935. Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022, 2003. Bulatov, Andrei and Grohe, Martin. The complexity of partition functions. In Daz, Josep, Karhumki, Juhani, Lepist, Arto, and Sannella, Donald (eds.), Automata, Languages and Programming, volume 3142 of Lecture Notes in Computer Science, pp. 294 306. Springer, Heidelberg, Germany, 2004. Clenshaw, C. W. and Curtis, A. R. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197 205, 1960. Clifford, Peter. Markov random fields in statistics. In Disorder in physical systems, Oxford Science Publications, pp. 19 32. Oxford University Press, New York, 1990. Cover, Thomas M. and Thomas, Joy A. Elements of Information Theory. John Wiley & Sons, New York, NY, USA, 2nd edition, 2006. ISBN 978-0471241959. Ermon, Stefano, Gomes, Carla P., Sabharwal, Ashish, and Selman, Bart. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In 30th International Conference on Machine Learning, pp. 334 342, 2013. Fraser, W. A survey of methods of computing minimax and near-minimax polynomial approximations for functions of a single independent variable. Journal of the ACM, 12 (3):295 314, July 1965. Frey, Brendan J. Local probability propagation for factor analysis. In Solla, S.A., Leen, T.K., and M uller, K. (eds.), Advances in Neural Information Processing Systems, volume 12, pp. 442 448. MIT Press, Cambridge, MA, USA, 2000. Gautschi, W. Questions of numerical condition related to polynomials. Studies in Numerical Analysis, (24):140 177, 1985. Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011. Goldberg, Leslie Ann and Jerrum, Mark. A polynomialtime algorithm for estimating the partition function of the ferromagnetic Ising model on a regular matroid. SIAM Journal on Computing, 42(3):1132 1157, 2013. Gr unwald, Peter D. The Minimum Description Length Principle. The MIT Press, Cambridge, MA, USA, 2007. ISBN 978-0262072816. Han, Insu, Malioutov, Dmitry, and Shin, Jinwoo. Largescale log-determinant computation through stochastic Chebyshev expansions. In 32nd International Conference on Machine Learning, pp. 908 917, 2015. Heskes, Tom. Stable fixed points of loopy belief propagation are local minima of the Bethe free energy. In Becker, S., Thrun, S., and Obermayer, K. (eds.), Advances in Neural Information Processing Systems, volume 15, pp. 343 350, 2002. Jaakkola, Tommi S. and Jordan, Michael I. Computing upper and lower bounds on likelihoods in intractable networks. In 12th Annual Conference on Uncertainty in Artificial Intelligence, pp. 340 348. Morgan Kaufmann Publishers, 1996. Koller, Daphne and Friedman, Nir. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009. ISBN 978-0262013192. Kschischang, Frank R., Frey, Brendan J., and Loeliger, Hans-Andrea. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2): 498 519, 2001. Lauritzen, Steffen L. Graphical Models. Oxford University Press, Oxford, UK, 1996. ISBN 978-0198522195. Lauritzen, Steffen L. and Spiegelhalter, David J. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), 50 (2):157 224, 1988. Liu, Qiang, Peng, Jian, Ihler, Alexander, and Fisher III, John. Estimating the partition function by discriminance sampling. In 31st Annual Conference on Uncertainty in Artificial Intelligence, pp. 514 522. AUAI Press, 2015. Stochastic Discrete Clenshaw-Curtis Quadrature Mason, J.C. and Handscomb, David C. Chebyshev Polynomials. Chapman and Hall/CRC, 1st edition, 2002. ISBN 978-0849303555. Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Burlington, MA, USA, 1988. ISBN 978-1558604797. Pitman, Edwin James George. Sufficient statistics and intrinsic accuracy. Mathematical Cambridge Philosophical Society, 32:567 579, 1936. Ranganath, Rajesh, Tang, Linpeng, Charlin, Laurent, and Blei, David M. Deep exponential families. In Lebanon, Guy and Vishwanathan, S. V. N. (eds.), 18th International Conference on Artificial Intelligence and Statistics, volume 38 of JMLR W&CP. JMLR.org, 2015. Ruozzi, Nicholas. The Bethe partition function of logsupermodular graphical models. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 25, pp. 117 125. Curran Associates, Inc., 2012. Schraudolph, Nicol N. and Kamenetsky, Dmitry. Efficient exact inference in planar Ising models. In Koller, Daphne, Schuurmans, Dale, Bengio, Yoshua, and Bottou, L eon (eds.), Advances in Neural Information Processing Systems, volume 21, pp. 1417 1424, 2008. Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267 373, 2011. Valiant, Leslie G. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3): 410 421, 1979. Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2): 1 305, 2008. Wainwright, Martin J., Jaakkola, Tommi S., and Willsky, Alan S. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51 (7):2313 2335, 2005. Weiss, Yair. Comparing the mean field method and belief propagation for approximate inference in MRFs. In Opper, M. and Saad, D. (eds.), Advanced Mean Field Methods:Theory and Practice, pp. 229 239. MIT Press, Cambridge, MA, USA, 2001. Weller, Adrian and Jebara, Tony. Clamping variables and approximate inference. In Ghahramani, Zoubin, Welling, Max, Cortes, Corinna, Lawrence, Neil D., and Weinberger, Kilian Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 909 917, 2014. Xiang, Shuhuang, Chen, Xiaojun, and Wang, Haiyong. Error bounds for approximation in Chebyshev points. Numerische Mathematik, 116(3):463 491, 2010. Yedidia, Jonathan S., Freeman, William T., and Weiss, Yair. Exploring artificial intelligence in the new millennium. chapter Understanding Belief Propagation and its Generalizations, pp. 239 269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.