# clamping_variables_and_approximate_inference__52eae341.pdf Clamping Variables and Approximate Inference Adrian Weller Columbia University, New York, NY 10027 adrian@cs.columbia.edu Tony Jebara Columbia University, New York, NY 10027 jebara@cs.columbia.edu It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error. 1 Introduction Marginal inference and estimating the partition function for undirected graphical models, also called Markov random fields (MRFs), are fundamental problems in machine learning. Exact solutions may be obtained via variable elimination or the junction tree method, but unless the treewidth is bounded, this can take exponential time (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Wainwright and Jordan, 2008). Hence, many approximate methods have been developed. Of particular note is the Bethe approximation, which is widely used via the loopy belief propagation algorithm (LBP). Though this is typically fast and results are often accurate, in general it may converge only to a local optimum of the Bethe free energy, or may not converge at all (Mc Eliece et al., 1998; Murphy et al., 1999). Another drawback is that, until recently, there were no guarantees on whether the returned approximation to the partition function was higher or lower than the true value. Both aspects are in contrast to methods such as the tree-reweighted approximation (TRW, Wainwright et al., 2005), which features a convex free energy and is guaranteed to return an upper bound on the true partition function. Nevertheless, empirically, LBP or convergent implementations of the Bethe approximation often outperform other methods (Meshi et al., 2009; Weller et al., 2014). Using the method of graph covers (Vontobel, 2013), Ruozzi (2012) recently proved that the optimum Bethe partition function provides a lower bound for the true value, i.e. ZB Z, for discrete binary MRFs with submodular log potential cost functions of any arity. Here we provide an alternative proof for attractive binary pairwise models. Our proof does not rely on any methods of loop series (Sudderth et al., 2007) or graph covers, but rather builds on fundamental properties of the derivatives of the Bethe free energy. Our approach applies only to binary models (whereas Ruozzi, 2012 applies to any arity), but we obtain stronger results for this class, from which ZB Z easily follows. We use the idea of clamping a variable and considering the approximate sub-partition functions over the remaining variables, as the clamped variable takes each of its possible values. Notation and preliminaries are presented in 2. In 3, we derive a lower bound, not just for the standard Bethe partition function, but for a range of approximate partition functions over multi-label variables that may be defined from a variational perspective as an optimization problem, based only on the topology of the model. In 4, we consider the Bethe approximation for attractive binary pairwise models. We show that clamping any variable and summing the Bethe sub-partition functions over the remaining variables can only increase (hence improve) the approximation. Together with a similar argument to that used in 3, this proves that ZB Z for this class of model. To derive the result, we analyze how the optimum of the Bethe free energy varies as the singleton marginal of one particular variable is fixed to different values in [0, 1]. Remarkably, we show that the negative of this optimum, less the singleton entropy of the variable, is a convex function of the singleton marginal. This may have further interesting implications. We present experiments in 5, demonstrating that clamping even a single variable selected using a simple heuristic can be very beneficial. 1.1 Related work Branching or conditioning on a variable (or set of variables) and approximating over the remaining variables has a fruitful history in algorithms such as branch-and-cut (Padberg and Rinaldi, 1991; Mitchell, 2002), work on resolution versus search (Rish and Dechter, 2000) and various approaches of (Darwiche, 2009, Chapter 8). Cutset conditioning was discussed by Pearl (1988) and refined by Peot and Shachter (1991) as a method to render the remaining topology acyclic in preparation for belief propagation. Eaton and Ghahramani (2009) developed this further, introducing the conditioned belief propagation algorithm together with back-belief-propagation as a way to help identify which variables to clamp. Liu et al. (2012) discussed feedback message passing for inference in Gaussian (not discrete) models, deriving strong results for the particular class of attractive models. Choi and Darwiche (2008) examined methods to approximate the partition function by deleting edges. 2 Preliminaries We consider a pairwise model with n variables X1, . . . , Xn and graph topology (V, E): V contains nodes {1, . . ., n} where i corresponds to Xi, and E V V contains an edge for each pairwise relationship. We sometimes consider multi-label models where each variable Xi takes values in {0, . . . , Li 1}, and sometimes restrict attention to binary models where Xi B = {0, 1} i. Let x = (x1, . . . , xn) be a configuration of all the variables, and N(i) be the neighbors of i. For all analysis of binary models, to be consistent with Welling and Teh (2001) and Weller and Jebara (2013), we assume a reparameterization such that p(x) = e E(x) Z , where the energy of a configuration, E = P i V θixi P (i,j) E Wijxixj, with singleton potentials θi and edge weights Wij. 2.1 Clamping a variable and related definitions We shall find it useful to examine sub-partition functions obtained by clamping one particular variable Xi, that is we consider the model on the n 1 variables X1, . . . , Xi 1, Xi+1, . . . , Xn obtained by setting Xi equal to one of its possible values. Let Z|Xi=a be the sub-partition function on the model obtained by setting Xi = a, a {0, . . ., Li 1}. Observe that true partition functions and marginals are self-consistent in the following sense: j=0 Z|Xi=j i V, p(Xi = a) = Z|Xi=a PLi 1 j=0 Z|Xi=j . (1) This is not true in general for approximate forms of inference,1 but if the model has no cycles, then in many cases of interest, (1) does hold, motivating the following definition. Definition 1. We say an approximation to the log-partition function ZA is Exact On Trees if it may be specified by the variational formula log ZA = minq Q FA(q) where: (1) Q is some compact space that includes the marginal polytope; (2) FA is a function of the (pseudo-)distribution q (typically a free energy approximation); and (3) For any model, whenever a subset of variables V V is clamped to particular values P = {pi {0, . . . , Li 1}, Xi V }, i.e. Xi V , we constrain 1For example, consider a single cycle with positive edge weights. This has ZB < Z (Weller et al., 2014), yet after clamping any variable, each resulting sub-model is a tree hence the Bethe approximation is exact. Xi = pi, which we write as V P, and the remaining induced graph on V \ V is acyclic, then the approximation is exact, i.e. ZA|V P = Z|V P . Similarly, define an approximation to be in the broader class of Not Smaller On Trees if it satisfies all of the above properties except that condition (3) is relaxed to ZA|V P Z|V P . Note that the Bethe approximation is Exact On Trees, and approximations such as TRW are Not Smaller On Trees, in both cases whether using the marginal polytope or any relaxation thereof, such as the cycle or local polytope (Weller et al., 2014). We shall derive bounds on ZA with the following idea: Obtain upper or lower bounds on the approximation achieved by clamping and summing over the approximate sub-partition functions; Repeat until an acyclic graph is reached, where the approximation is either exact or bounded. We introduce the following related concept from graph theory. Definition 2. A feedback vertex set (FVS) of a graph is a set of vertices whose removal leaves a graph without cycles. Determining if there exists a feedback vertex set of a given size is a classical NP-hard problem (Karp, 1972). There is a significant literature on determining the minimum cardinality of an FVS of a graph G, which we write as ν(G). Further, if vertices are assigned nonnegative weights, then a natural problem is to find an FVS with minimum weight, which we write as νw(G). An FVS with a factor 2 approximation to νw(G) may be found in time O(|V| + |E| log |E|) (Bafna et al., 1999). For pairwise multi-label MRFs, we may create a weighted graph from the topology by assigning each node i a weight of log Li, and then compute the corresponding νw(G). 3 Lower Bound on Approximate Partition Functions We obtain a lower bound on any approximation that is Not Smaller On Trees by observing that ZA ZA|Xn=j j from the definition (the sub-partition functions optimize over a subset). Theorem 3. If a pairwise MRF has topology with an FVS of size n and corresponding values L1, . . . , Ln, then for any approximation that is Not Smaller On Trees, ZA Z Qn i=1 Li . Proof. We proceed by induction on n. The base case n = 0 holds by the assumption that ZA is Not Smaller On Trees. Now assume the result holds for n 1 and consider a MRF which requires n vertices to be deleted to become acyclic. Clamp variable Xn at each of its Ln values to create the approximation Z(n) A := PLn 1 j=0 ZA|Xn=j. By the definition of Not Smaller On Trees, ZA ZA|Xn=j j; and by the inductive hypothesis, ZA|Xn=j Z|Xn=j Qn 1 i=1 Li . Hence, Ln ZA Z(n) A = PLn 1 j=0 ZA|Xn=j 1 Qn 1 i=1 Li PLn 1 j=0 Z|Xn=j = Z Qn 1 i=1 Li . By considering an FVS with minimum Qn i=1 Li, Theorem 3 is equivalent to the following result. Theorem 4. For any approximation that is Not Smaller On Trees, ZA Ze νw. This bound applies to general multi-label models with any pairwise and singleton potentials (no need for attractive). The bound is trivial for a tree, but already for a binary model with one cycle we obtain that ZB Z/2 for any potentials, even over the marginal polytope. The bound is tight, at least for uniform Li = L i.2 The bound depends only on the vertices that must be deleted to yield a graph with no cycles, not on the number of cycles (which clearly upper bounds ν(G)). For binary models, exact inference takes time Θ((|V| |ν(G)|)2ν(G)). 4 Attractive Binary Pairwise Models In this Section, we restrict attention to the standard Bethe approximation. We shall use results derived in (Welling and Teh, 2001) and (Weller and Jebara, 2013), and adopt similar notation. The Bethe partition function, ZB, is defined as in Definition 1, where Q is set as the local polytope relaxation and FA is the Bethe free energy, given by F(q) = Eq(E) SB(q), where E is the energy 2For example, in the binary case: consider a sub-MRF on a cycle with no singleton potentials and uniform, very high edge weights. This can be shown to have ZB Z/2 (Weller et al., 2014). Now connect ν of these together in a chain using very weak edges (this construction is due to N. Ruozzi). and SB is the Bethe pairwise entropy approximation (see Wainwright and Jordan, 2008 for details). We consider attractive binary pairwise models and apply similar clamping ideas to those used in 3. In 4.1 we show that clamping can never decrease the approximate Bethe partition function, then use this result in 4.2 to prove that ZB Z for this class of model. In deriving the clamping result of 4.1, in Theorem 7 we show an interesting, stronger result on how the optimum Bethe free energy changes as the singleton marginal qi is varied over [0, 1]. 4.1 Clamping a variable can only increase the Bethe partition function Let ZB be the Bethe partition function for the original model. Clamp variable Xi and form the new approximation Z(i) B = P1 j=0 ZB|Xi=j. In this Section, we shall prove the following Theorem. Theorem 5. For an attractive binary pairwise model and any variable Xi, Z(i) B ZB. We first introduce notation and derive preliminary results, which build to Theorem 7, our strongest result, from which Theorem 5 easily follows. Let q = (q1, . . . , qn) be a location in n-dimensional pseudomarginal space, i.e. qi is the singleton pseudomarginal q(Xi = 1) in the local polytope. Let F(q) be the Bethe free energy computed at q using Bethe optimum pairwise pseudomarginals given by the formula for q(Xi = 1, Xj = 1) = ξij(qi, qj, Wij) in (Welling and Teh, 2001), i.e. for an attractive model, for edge (i, j), ξij is the lower root of αijξ2 ij [1 + αij(qi + qj)]ξij + (1 + αij)qiqj = 0, (2) where αij = e Wij 1, and Wij > 0 is the strength (associativity) of the log-potential edge weight. Let G(q) = F(q). Note that log ZB = maxq [0,1]n G(q). For any x [0, 1], consider the optimum constrained by holding qi = x fixed, i.e. let log ZBi(x) = maxq [0,1]n:qi=x G(q). Let r (x) = (r 1(x), . . . , r i 1(x), r i+1(x), . . . , r n(x)) with corresponding pairwise terms {ξ ij}, be an arg max for where this optimum occurs. Observe that log ZBi(0) = log ZB|Xi=0, log ZBi(1) = log ZB|Xi=1 and log ZB = log ZBi(q i ) = maxq [0,1]n G(q), where q i is a location of Xi at which the global optimum is achieved. To prove Theorem 5, we need a sufficiently good upper bound on log ZBi(q i ) compared to log ZBi(0) and log ZBi(1). First we demonstrate what such a bound could be, then prove that this holds. Let Si(x) = x log x (1 x) log(1 x) be the standard singleton entropy. Lemma 6 (Demonstrating what would be a sufficiently good upper bound on log ZB). If x [0, 1] such that log ZB x log ZBi(1) + (1 x) log ZBi(0) + Si(x), then: (i) ZBi(0) + ZBi(1) ZB emfc(x) where fc(x) = 1 + ec exc+Si(x), m = min(log ZBi(0), log ZBi(1)) and c = | log ZBi(1) log ZBi(0)|; and (ii) x [0, 1], fc(x) 0 with equality iff x = σ(c) = 1/(1 + exp( c)), the sigmoid function. Proof. (i) This follows easily from the assumption. (ii) This is easily checked by differentiating. It is also given in (Koller and Friedman, 2009, Proposition 11.8). See Figure 6 in the Supplement for example plots of the function fc(x). Lemma 6 motivates us to consider if perhaps log ZBi(x) might be upper bounded by x log ZBi(1)+(1 x) log ZBi(0)+Si(x), i.e. the linear interpolation between log ZBi(0) and log ZBi(1), plus the singleton entropy term Si(x). It is easily seen that this would be true if r (qi) were constant. In fact, we shall show that r (qi) varies in a particular way which yields the following, stronger result, which, together with Lemma 6, will prove Theorem 5. Theorem 7. Let Ai(qi) = log ZBi(qi) Si(qi). For an attractive binary pairwise model, Ai(qi) is a convex function. Proof. We outline the main points of the proof. Observe that Ai(x) = maxq [0,1]n:qi=x G(q) Si(x), where G(q) = F(q). Note that there may be multiple arg max locations r (x). As shown in (Weller and Jebara, 2013), F is at least thrice differentiable in (0, 1)n and all stationary points lie in the interior (0, 1)n. Given our conditions, the envelope theorem of (Milgrom, 1999, Theorem v=1/Qij, W=1 v=1/Qij, W=3 v=1/Qij, W=10 Figure 1: 3d plots of vij = Q 1 ij , using ξij(qi, qj, W ) from (Welling and Teh, 2001). 1) applies, showing that Ai is continuous in [0, 1] with right derivative3 A i+(x) = max r (qi=x) x [G(qi = x, r (x)) Si(x)] = max r (qi=x) x [G(qi = x, r (x))] d Si(x) (3) We shall show that this is non-decreasing, which is sufficient to show the convexity result of Theorem 7. To evaluate the right hand side of (3), we use the derivative shown by Welling and Teh (2001): F qi = θi + log Qi, where log Qi = log (1 qi)di 1 Q j N(i)(qi ξij) j N(i)(1 + ξij qi qj) (as in Weller and Jebara, 2013) 1 qi + log Y j N(i) Qij, here defining Qij = qi ξij 1 + ξij qi qj A key observation is that the log qi 1 qi term is exactly d Si(qi) dqi , and thus cancels the d Si(x) at the end of (3). Hence, A i+(qi) = maxr (qi) h P j N(i) log Qij(qi, r j , ξ ij) i . 4 It remains to show that this expression is non-decreasing with qi. We shall show something stronger, that at every arg max r (qi), and for all j N(i), log Qij is non-decreasing vij = Q 1 ij is nondecreasing. The result then follows since the max of non-decreasing functions is non-decreasing. See Figure 1 for example plots of the vij function, and observe that vij appears to decrease with qi (which is unhelpful here) while it increases with qj. Now, in an attractive model, the Bethe free energy is submodular, i.e. 2F qi qj 0 (Weller and Jebara, 2013), hence as qi increases, r j (qi) can only increase (Topkis, 1978). For our purpose, we must show that dr j dqi is sufficiently large such that dqi 0. This forms the remainder of the proof. At any particular arg max r (qi), writing v = vij[qi, r j (qi), ξ ij(qi, r j (qi))], we have From (Weller and Jebara, 2013), ξij qi = αij(qj ξij)+qj 1+αij(qi ξij+qj ξij) and similarly, ξij αij(qi ξij)+qi 1+αij(qj ξij+qi ξij), where αij = e Wij 1. The other partial derivatives are easily derived: qi = qi(qj 1)(1 qi)+(1+ξij qi qj)(qi ξij) (1 qi)2(qi ξij)2 , v ξij = qi(1 qj) (1 qi)(qi ξij)2 , and v (1 qi)(qi ξij). The only remaining term needed for (4) is dr j dqi . The following results are proved in the Appendix, subject to a technical requirement that at an arg max, the reduced Hessian H\i, i.e. the matrix of 3This result is similar to Danskin s theorem (Bertsekas, 1995). Intuitively, for multiple arg max locations, each may increase at a different rate, so here we must take the max of the derivatives over all the arg max. 4We remark that Qij is the ratio p(Xi=1,Xj=0) p(Xi=0,Xj=0) . p(Xi=1) p(Xi=0) = p(Xj=0|Xi=1) p(Xj=0|Xi=0). second partial derivatives of F after removing the ith row and column, must be non-singular in order to have an invertible locally linear function. Call this required property P. By nature, each H\i is positive semi-definite. If needed, a small perturbation argument allows us to assume that no eigenvalue is 0, then in the limit as the perturbation tends to 0, Theorem 7 holds since the limit of convex functions is convex. Let [n] = {1, . . . , n} and G be the topology of the MRF. Theorem 8. For any k [n] \ i, let Ck be the connected component of G \ i that contains Xk. If Ck + i is a tree, then dr k dqi = Q (s t) P (i k) ξ st r sr t r s (1 r s),where P(i k) is the unique path from i to k in Ck + i, and for notational convenience, define r i = qi. Proof in Appendix (subject to P). In fact, this result applies for any combination of attractive and repulsive edges. The result is remarkable, yet also intuitive. In the numerator, ξst qsqt = Covq(Xs, Xt), increasing with Wij and equal to 0 at Wij = 0 (Weller and Jebara, 2013), and in the denominator, qs(1 qs) = Varq(Xs), hence the ratio is exactly what is called in finance the beta of Xt with respect to Xs.5 In particular, Theorem 8 shows that for any j N(i) whose component is a tree, dr j dqi = ξ ij qir j qi(1 qi) . The next result shows that in an attractive model, additional edges can only reinforce this sensitivity. Theorem 9. In an attractive model with edge (i, j), dr j (qi) dqi ξ ij qir j qi(1 qi) . Proof in Appendix (subject to P). Now collecting all terms, substituting into (4), and using (2), after some algebra yields that dv dqi 0, as required to prove Theorem 7. This now also proves Theorem 5. 4.2 The Bethe partition function lower bounds the true partition function Theorem 5, together with an argument similar to the proof of Theorem 3, easily yields a new proof that ZB Z for an attractive binary pairwise model. Theorem 10 (first proved by Ruozzi, 2012). For an attractive binary pairwise model, ZB Z. Proof. We shall use induction on n to show that the following statement holds for all n: If a MRF may be rendered acyclic by deleting n vertices v1, . . . , vn, then ZB Z. The base case n = 0 holds since the Bethe approximation is Exact On Trees. Now assume the result holds for n 1 and consider a MRF which requires n vertices to be deleted to become acyclic. Clamp variable Xn and consider Z(n) B = P1 j=0 ZB|Xn=j. By Theorem 5, ZB Z(n) B ; and by the inductive hypothesis, ZB|Xn=j Z|Xn=j j. Hence, ZB P1 j=0 ZB|Xn=j P1 j=0 Z|Xn=j = Z. 5 Experiments For an approximation which is Exact On Trees, it is natural to try clamping a few variables to remove cycles from the topology. Here we run experiments on binary pairwise models to explore the potential benefit of clamping even just one variable, though the procedure can be repeated. For exact inference, we used the junction tree algorithm. For approximate inference, we used Frank-Wolfe (FW) (Frank and Wolfe, 1956): At each iteration, a tangent hyperplane to the approximate free energy is computed at the current point, then a move is made to the best computed point along the line to the vertex of the local polytope with the optimum score on the hyperplane. This proceeds monotonically, even on a non-convex surface, hence will converge (since it is bounded), though it may be only to a local optimum and runtime is not guaranteed. This method typically produces good solutions in reasonable time compared to other approaches (Belanger et al., 2013; Weller et al., 2014) and allows direct comparison to earlier results (Meshi et al., 2009; Weller et al., 2014). To further facilitate comparison, in this Section we use the same unbiased reparameterization used by Weller et al. (2014), with E = P i V θixi P (i,j) E Wij 2 [xixj + (1 xi)(1 xj)]. 5Sudderth et al. (2007) defined a different, symmetric βst = ξst qsqt qs(1 qs)qt(1 qt) for analyzing loop series. In our context, we suggest that the ratio defined above may be a better Bethe beta. Test models were constructed as follows: For n variables, singleton potentials were drawn θi U[ Tmax, Tmax]; edge weights were drawn Wij U[0, Wmax] for attractive models, or Wij U[ Wmax, Wmax] for general models. For models with random edges, we constructed Erd os-Renyi random graphs (rejecting disconnected samples), where each edge has independent probability p of being present. To observe the effect of increasing n while maintaining approximately the same average degree, we examined n = 10, p = 0.5 and n = 50, p = 0.1. We also examined models on a complete graph topology with 10 variables for comparison with TRW in (Weller et al., 2014). 100 models were generated for each set of parameters with varying Tmax and Wmax values. Results are displayed in Figures 2 to 4 showing average absolute error of log ZB vs log Z and average ℓ1 error of singleton marginals. The legend indicates the different methods used: Original is FW on the initial model; then various methods were used to select the variable to clamp, before running FW on the 2 resulting submodels and combining those results. avg Clamp for log Z means average over all possible clampings, whereas all Clamp for marginals computes each singleton marginal as the estimated ˆpi = ZB|Xi=1/(ZB|Xi=0 + ZB|Xi=1). best Clamp uses the variable which with hindsight gave the best improvement in log Z estimate, thereby showing the best possible result for log Z. Similarly, worst Clamp picks the variable which showed worst performance. Where one variable is clamped, the respective marginals are computed thus: for the clamped variable Xi, use ˆpi as before; for all others, take the weighted average over the estimated Bethe pseudomarginals on each sub-model using weights 1 ˆpi and ˆpi for sub-models with Xi = 0 and Xi = 1 respectively. max W and Mpower are heuristics to try to pick a good variable in advance. Ideally, we would like to break heavy cycles, but searching for these is NP-hard. max W is a simple O(|E|) method which picks a variable Xi with maxi V P j N(i) |Wij|, and can be seen to perform well (Liu et al., 2012 proposed the same max W approach for inference in Gaussian models). One way in which max W can make a poor selection is to choose a variable at the centre of a large star configuration but far from any cycle. Mpower attempts to avoid this by considering the convergent series of powers of a modified W matrix, but on the examples shown, this did not perform significantly better. See 8.1 in the Appendix for more details on Mpower and further experimental results. FW provides no runtime guarantee when optimizing over a non-convex surface such as the Bethe free energy, but across all parameters, the average combined runtimes on the two clamped submodels was the same order of magnitude as that for the original model, see Figure 5. 6 Discussion The results of 4 immediately also apply to any binary pairwise model where a subset of variables may be flipped to yield an attractive model, i.e. where the topology has no frustrated cycle (Weller et al., 2014), and also to any model that may be reduced to an attractive binary pairwise model (Schlesinger and Flach, 2006; Zivny et al., 2009). For this class, together with the lower bound of 3, we have sandwiched the range of ZB (equivalently, given ZB, we have sandwiched the range of the true partition function Z) and bounded its error; further, clamping any variable, solving for optimum log ZB on sub-models and summing is guaranteed to be more accurate than solving on the original model. In some cases, it may also be faster; indeed, some algorithms such as LBP may fail on the original model but perform well on clamped sub-models. Methods presented may prove useful for analyzing general (non-attractive) models, or for other applications. As one example, it is known that the Bethe free energy is convex for a MRF whose topology has at most one cycle (Pakzad and Anantharam, 2002). In analyzing the Hessian of the Bethe free energy, we are able to leverage this to show the following result, which may be useful for optimization (proof in Appendix; this result was conjectured by N. Ruozzi). Lemma 11. In a binary pairwise MRF (attractive or repulsive edges, any topology), for any subset of variables S V whose induced topology contains at most one cycle, the Bethe free energy (using optimum pairwise marginals) over S, holding variables V \S at fixed singleton marginals, is convex. In 5, clamping appears to be very helpful, especially for attractive models with low singleton potentials where results are excellent (overcoming TRW s advantage in this context), but also for general models, particularly with the simple max W selection heuristic. We can observe some decline in benefit as n grows but this is not surprising when clamping just a single variable. Note, however, 2 4 8 12 16 0 maximum coupling strength Wmax (a) attractive log Z, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original all Clamp max W Clamp best Clamp worst Clamp Mpower TRW (b) attractive margs, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original avg Clamp max W Clamp best Clamp worst Clamp Mpower TRW (c) general log Z, Tmax = 2 2 4 8 12 16 0 maximum coupling strength Wmax (d) general margs, Tmax = 2 Figure 2: Average errors vs true, complete graph on n = 10. TRW in pink. Consistent legend throughout. 2 4 8 12 16 0 maximum coupling strength Wmax Original avg Clamp max W Clamp best Clamp worst Clamp Mpower (a) attractive log Z, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original all Clamp max W Clamp best Clamp worst Clamp Mpower (b) attractive margs, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original avg Clamp max W Clamp best Clamp worst Clamp Mpower (c) general log Z, Tmax = 2 2 4 8 12 16 0 maximum coupling strength Wmax Original all Clamp max W Clamp best Clamp worst Clamp Mpower (d) general margs, Tmax = 2 Figure 3: Average errors vs true, random graph on n = 10, p = 0.5. Consistent legend throughout. 2 4 8 12 16 0 maximum coupling strength Wmax (a) attractive log Z, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original all Clamp max W Clamp best Clamp worst Clamp Mpower (b) attractive margs, Tmax = 0.1 2 4 8 12 16 0 maximum coupling strength Wmax Original avg Clamp max W Clamp best Clamp worst Clamp Mpower (c) general log Z, Tmax = 2 2 4 8 12 16 0 maximum coupling strength Wmax Original all Clamp max W Clamp best Clamp worst Clamp Mpower (d) general margs, Tmax = 2 Figure 4: Average errors vs true, random graph on n = 50, p = 0.1. Consistent legend throughout. 2 4 8 12 16 1 maximum coupling strength Wmax Random n=10, Tmax=2 Random n=10, Tmax=0.1 Random n=50, Tmax=2 Random n=50, Tmax=0.1 (a) attractive random graphs 2 4 8 12 16 1 maximum coupling strength Wmax Random n=10, Tmax=2 Random n=10, Tmax=0.1 Random n=50, Tmax=2 Random n=50, Tmax=0.1 (b) general random graphs (c) Blue (dashed red) edges are attractive (repulsive) with edge weight +2 ( 2). No singleton potentials. Figure 5: Left: Average ratio of combined sub-model runtimes to original runtime (using max W, other choices are similar). Right: Example model where clamping any variable worsens the Bethe approximation to log Z. that non-attractive models exist such that clamping and summing over any variable can lead to a worse Bethe approximation of log Z, see Figure 5c for a simple example on four variables. It will be interesting to explore the extent to which our results may be generalized beyond binary pairwise models. Further, it is tempting to speculate that similar results may be found for other approximations. For example, some methods that upper bound the partition function, such as TRW, might always yield a lower (hence better) approximation when a variable is clamped. Acknowledgments. We thank Nicholas Ruozzi for careful reading, and Nicholas, David Sontag, Aryeh Kontorovich and Toma z Slivnik for helpful discussion and comments. This work was supported in part by NSF grants IIS-1117631 and CCF-1302269. V. Bafna, P. Berman, and T. Fujito. A 2-approximation algorithm for the undirected feedback vertex set problem. SIAM Journal on Discrete Mathematics, 12(3):289 9, 1999. D. Belanger, D. Sheldon, and A. Mc Callum. Marginal inference in MRFs using Frank-Wolfe. In NIPS Workshop on Greedy Optimization, Frank-Wolfe and Friends, December 2013. D. Bertsekas. Nonlinear Programming. Athena Scientific, 1995. A. Choi and A. Darwiche. Approximating the partition function by deleting and then correcting for model edges. In Uncertainty in Artificial Intelligence (UAI), 2008. A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009. F. Eaton and Z. Ghahramani. Choosing a variable to clamp: Approximate inference using conditioned belief propagation. In Artificial Intelligence and Statistics, 2009. K. Fan. Topological proofs for certain theorems on matrices with non-negative elements. Monatshefte fr Mathematik, 62:219 237, 1958. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2): 95 110, 1956. ISSN 1931-9193. doi: 10.1002/nav.3800030109. R. Karp. Complexity of Computer Computations, chapter Reducibility Among Combinatorial Problems, pages 85 103. New York: Plenum., 1972. D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009. S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society series B, 50:157 224, 1988. Y. Liu, V. Chandrasekaran, A. Anandkumar, and A. Willsky. Feedback message passing for inference in Gaussian graphical models. IEEE Transactions on Signal Processing, 60(8):4135 4150, 2012. R. Mc Eliece, D. Mac Kay, and J. Cheng. Turbo decoding as an instance of Pearl s Belief Propagation algorithm. IEEE Journal on Selected Areas in Communications, 16(2):140 152, 1998. O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. Convexifying the Bethe free energy. In UAI, 2009. P. Milgrom. The envelope theorems. Department of Economics, Standford University, Mimeo, 1999. URL http://www-siepr.stanford.edu/workp/swp99016.pdf. J. Mitchell. Branch-and-cut algorithms for combinatorial optimization problems. Handbook of Applied Optimization, pages 65 77, 2002. K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence (UAI), 1999. M. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM review, 33(1):60 100, 1991. P. Pakzad and V. Anantharam. Belief propagation and statistical physics. In Princeton University, 2002. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. M. Peot and R. Shachter. Fusion and propagation with multiple observations in belief networks. Artificial Intelligence, 48(3):299 318, 1991. I. Rish and R. Dechter. Resolution versus search: Two strategies for SAT. Journal of Automated Reasoning, 24 (1-2):225 275, 2000. N. Ruozzi. The Bethe partition function of log-supermodular graphical models. In Neural Information Processing Systems, 2012. D. Schlesinger and B. Flach. Transforming an arbitrary minsum problem into a binary one. Technical report, Dresden University of Technology, 2006. E. Sudderth, M. Wainwright, and A. Willsky. Loop series and Bethe variational bounds in attractive graphical models. In NIPS, 2007. D. Topkis. Minimizing a submodular function on a lattice. Operations Research, 26(2):305 321, 1978. P. Vontobel. Counting in graph covers: A combinatorial characterization of the Bethe entropy function. Information Theory, IEEE Transactions on, 59(9):6018 6048, Sept 2013. ISSN 0018-9448. M. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008. M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313 2335, 2005. A. Weller and T. Jebara. Bethe bounds and approximating the global optimum. In AISTATS, 2013. A. Weller and T. Jebara. Approximating the Bethe partition function. In UAI, 2014. A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Bethe approximation: When and how can it go wrong? In Uncertainty in Artificial Intelligence (UAI), 2014. M. Welling and Y. Teh. Belief optimization for binary networks: A stable alternative to loopy belief propagation. In Uncertainty in Artificial Intelligence (UAI), 2001. S. Zivny, D. Cohen, and P. Jeavons. The expressive power of binary submodular functions. Discrete Applied Mathematics, 157(15):3347 3358, 2009.