# uprooting_and_rerooting_graphical_models__89200e61.pdf

Uprooting and Rerooting Graphical Models

Adrian Weller ADRIAN.WELLER@ENG.CAM.AC.UK Department of Engineering, University of Cambridge, United Kingdom

We show how any binary pairwise model may be uprooted to a fully symmetric model, wherein original singleton potentials are transformed to potentials on edges to an added variable, and then rerooted to a new model on the original number of variables. The new model is essentially equivalent to the original model, with the same partition function and allowing recovery of the original marginals or a MAP conﬁguration, yet may have very different computational properties that allow much more efﬁcient inference. This metaapproach deepens our understanding, may be applied to any existing algorithm to yield improved methods in practice, generalizes earlier theoretical results, and reveals a remarkable interpretation of the triplet-consistent polytope.

1. Introduction

Undirected graphical models, also called Markov random ﬁelds (MRFs), have become a central tool in machine learning, providing a powerful and compact way to model relationships between variables. However, many key problems, such as identifying a conﬁguration with highest probability (termed maximum a posteriori or MAP inference), estimating marginal probabilities of subsets of variables (marginal inference) or calculating the normalizing partition function, are typically computationally intractable, leading to much work to identify settings where exact polynomial-time methods apply, or to create approximate algorithms that perform well.

Focusing on binary pairwise models (see 2 for deﬁnitions and details), we provide a general meta-method for inference that generalizes and strengthens existing theoretical results, deepens our understanding, and can help signiﬁcantly in practice. Suppose a model M has n variables X1, . . . Xn with various singleton and edge potentials. We start by uprooting this model to a uniquely determined par Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

ent model M + where all previous singleton potentials are converted to edge potentials to a newly added variable X0. This M + model is elegantly symmetric: no singleton potentials, and all edge potentials give a score only if incident variables are different. This uprooting is not a novel idea for MAP inference (Barahona et al., 1988; Sontag, 2007) but we believe the other ideas presented here are new.

The uprooted M + model is interesting in itself; for example, its partition function is exactly twice that of the original model M, which we may consider as the parent M +

model rooted at X0. A key idea is that we can reroot M +

at any other variable, for example X2, to yield an equivalent model on n variables X0, X1, X3, . . . , Xn, which has new singleton potentials determined by the edge potentials to X2 in M +. In effect, this is a different view or crystallization of the parent symmetric M + model. Call this X2-rooted model M2 (we could root at any variable Xi to obtain Mi; note that the original model M is M0).

We make the following observations.

M2 indeed represents essentially the same model as M. It lies in the equivalence class of models that map to the same unique symmetric representation M +. M2 has the same partition function as M but may have very different computational properties. The original model M might be hard but M2 could be easy. Using any existing inference method for M2, it is not hard also to recover all the original singleton marginals or a MAP conﬁguration of M, see 4.1. Hence we have a general meta-method for inference: given any inference approach, instead of applying it to M, we can instead consider various equivalent rerooted models and apply the approach to one of them instead, which may work much better. Many existing methods and bounds apply only to particular ranges of edge and singleton potentials, which are changed after rerooting. Hence, we can generalize existing approaches. We discuss various implications in 5. For example, we can use the very efﬁcient max ﬂow/min cut method for MAP inference in a model if all edges are attractive with no conditions on singleton potentials (more generally if the model is balanced, see 2.1). This might not be possible in the original model M but will

Uprooting and Rerooting Graphical Models

be possible in some rerooted model iff there exists some variable Xi in M + s.t. after rooting at Xi, the remainder Mi is balanced. This is equivalent to the condition that M + is almost balanced. This can be tested efﬁciently. Understanding that singleton and pairwise potentials appear different only due to a particular choice of root in M + provides an important fresh perspective, leading to a re-evaluation of existing methods, and a remarkable interpretation of the triplet-consistent polytope, see 5.

Binary pairwise models play an important role in many ﬁelds such as computer vision (Blake et al., 2011). Further, any discrete graphical model may essentially be converted into an equivalent binary pairwise model, though this may require a large increase in the number of variables.1

Contributions. After providing background in 2, we explain the details of the uprooting and rerooting approach in 3-4, showing how inference on a rerooted model allows recovery of information about the original model. This includes a discussion in 4.2 of the relation to clamping, where we introduce a new clamping heuristic that performs particularly well in settings that are likely to arise for rerooting. In 5, we discuss implications of rerooting, showing how it generalizes theoretical results, may be used as a meta-algorithm for inference methods, and provides a fascinating perspective on the triplet-consistent polytope. We provide an empirical evaluation in 6, showing that rerooting can be particularly effective for models with dense, strong edges and weak singleton potentials.

Related Work. What we call uprooting has been described previously as a way to reduce MAP inference of M to the MAXCUT problem in M + (Barahona et al., 1988). As we discuss in 4.2, uprooting to M + may be viewed as a de-clamping of the model at X0, while a rerooting may be considered a re-clamping at a different variable. Hence, rerooting replaces an initial implicit clamp choice at X0 with another. The choice of which root to choose is essentially the question of which variable in M + to clamp. Methods to select a variable to clamp have been explored by Eaton and Ghahramani (2009) and Weller and Domke (2016).

2. Preliminaries

We focus on undirected graphical models with n binary variables X1, . . . , Xn {0, 1}. We consider a probability distribution p(x) = eθ(x)/Z(θ) where x = (x1, . . . , xn) is one particular conﬁguration of all variables and θ(x) is the score of conﬁguration x, which decomposes into singleton and pairwise (log) potential terms. The topology of

1Eaton and Ghahramani (2013) show that this is strictly true if all model states have probability strictly > 0, otherwise an arbitrarily good approximation is possible.

the model is the graph (V, E), that is V = {1, . . . , n} where i corresponds to Xi, and E V V contains an edge for each pairwise score relationship. We assume a reparameterization to the minimal representation (Wainwright and Jordan, 2008) wherein the score may be written

(i,j) E Wij1[xi = xj], (1)

where 1[ ] is the indicator function. Z(θ) is a normalizing constant, called the partition function, which ensures that probabilities sum to 1, i.e. Z(θ) = P

x {0,1}n eθ(x).

Note that (1) gives a score to an edge only if its incident variables are different. The factor of 1

2 before the edge potentials means that the signs and scaling of our parameters are consistent with earlier work such as (Welling and Teh, 2001; Weller and Domke, 2016). If Wij > 0 then the edge (i, j) is attractive and tends to pull its incident variables towards the same value; if Wij < 0 then the edge is repulsive and tends to push apart the values of its variables.

2.1. Attractive, Mixed and Balanced Models

If all edges of a model are attractive, i.e. if Wij 0 (i, j) E, then we say that the model is attractive, else it is mixed. Sometimes a mixed model may be converted to an equivalent attractive model by ﬂipping (also called switching) a subset of variables S, which reverses the signs of their singleton potentials and of the edge potentials between variables in S and variables in V\S; if possible, such a mixed model is called balanced. Harary (1953) showed that a model is balanced iff it does not contain a frustrated cycle, which is a cycle containing an odd number of repulsive edges. This may be checked and, if balanced, then a ﬂipping set S found, in time linear in |E| (Harary and Kabell, 1980). Hence, results for attractive models readily extend to the larger class of balanced models.

Notation. For any a {0, 1}, let a = 1 a (this ﬂips 0 1). Similarly, for a vector x = (x1, . . . , xn), let x = ( x1, . . . , xn) = (1 x1, . . . , 1 xn). For a conﬁguration y = (y0, y1, . . . , yn) of M +, and a {0, 1}, we may write y = (a, x) to mean y = (a, x1, . . . , xn).

3. Uprooting a Model

We show how any model M on n variables X1, . . . , Xn with singleton potentials may be uprooted to a unique symmetric (i.e. no singleton potentials) model M + on n + 1 variables X0, X1, . . . , Xn. Edges to the extra variable X0 encode the original singleton potentials.2

2If the original model M has no singleton potentials, then it may be regarded as already in M + form. It may still be rooted at any variable, as described in 4.

Uprooting and Rerooting Graphical Models

M + conﬁguration edges: score if ends are different x0 x1 x2 x3 e01 e02 e03 e12 e13 e23 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1

Table 1. An example showing all conﬁgurations of an uprooted M + model on 4 variables. The original model M has 3 variables X1, X2, X3 then X0 was added. Each conﬁguration of M features twice: once as (0, x1, x2, x3) in the top half, and then again with all settings ﬂipped as (1, x1, x2, x3) in the bottom. Each has the same score in M +, with the score determined only by the edges which are activated: see the check marks on the right and note their reﬂective symmetry across the horizontal line in the middle of the table. The pink shaded rows indicate the conﬁgurations for the rerooted model M2 where X2 = 0. Observe that given the symmetry, these correspond 1-1 with the conﬁgurations of M. Hence, we can recover the partition function, marginal probabilities or a MAP conﬁguration for M by inference on M2. For example, p0(X3 = 1) for M may be computed as p2(X3 = X0) for M2, i.e. sum over the rows shown in bold. Each of the rows in the top half with x3 = 1 which is missing from M2 (that is, not shaded pink) has an exactly corresponding row in the bottom half, as indicated by blue curves in the table. See 3 and 4 for details.

Let y = (y0, y1, . . . , yn) be a conﬁguration in M + of its n+1 variables, and let φ(y) be its score in M +. Requiring φ(y) to be in the same form as (1) but with no singleton potentials, and to match the scores of conﬁgurations in M when x0 = 0, i.e. requiring φ(0, x) = θ(x), implies

E Wij1[yi = yj], (2)

where the edges of M + are E = E F consisting of the original edges E of M, together with new edges F which are added to the new variable X0, given by F = {(0, i) : θi = 0}. Weights for edges in E remain unchanged. Weights for the new edges in F are set as W0i = 2θi. To see this, note that the singleton potentials in (1) are already in the form θi1[xi = x0|x0 = 0].

Note the sign ﬂip when a singleton potential is converted to

an edge potential, e.g. θi > 0 becomes a repulsive edge in M + with W0i < 0. This is an unavoidable consequence of choosing parameters in (1) to match earlier work. It may be helpful to think of θi > 0 as encouraging Xi to be different to 0, i.e. a repulsive edge from X0 = 0.

Observe that each conﬁguration x of M maps to two conﬁgurations y0 and y1 in M +,

M : x = (x1, . . . , xn) M + :

( y0 = (0, x) y1 = y0 = (1, x), (3) i.e. y0 = (0, x1, . . . , xn) and y1 = y0 = (1, x1, . . . , xn). Given the symmetry of (2), it is clear that φ(y0) = φ(y1). See Table 1 and Figure 1 for an example.

The partition function for M + is clearly twice that of M, i.e. Z(φ) = 2Z(θ). The original model M is exactly M +

conditioned on X0 = 0, and we can write M = M0.

4. Rerooting a Model

The symmetric model M + with form (2) may be rooted at any variable Xi by conditioning on Xi = 0 to yield a model on n variables consisting of {X0, . . . , Xi 1, Xi+1, . . . , Xn}, which we write as Mi.3

See Table 1 and Figure 1 for an example.

Considering (3), for any i, there is a score-preserving 11 correspondence between conﬁgurations in M and those in Mi which matches x in M with whichever of y0 or y1 has xi = 0 (the xi coordinate is removed to give the conﬁguration in Mi). Equivalently, if x in M has xi = 0, then it matches to (0, x1, . . . , xi 1, xi+1, . . . , xn) in Mi, otherwise it matches to the same conﬁguration but with all settings ﬂipped, i.e. (1, x1, . . . , xi 1, xi+1, . . . , xn).

4.1. Recovery of Original MAP Conﬁguration, Partition Function or Marginals

In this Section, we show that if inference can be performed on a rerooted model Mi, then we can recover results for the original model M.

4.1.1. MAP INFERENCE

For MAP inference, given the score-preserving 1-1 correspondence of conﬁgurations noted above in 4, if a MAP conﬁguration is determined for Mi, then the corresponding conﬁguration in M is a MAP conﬁguration for M with the same score. Speciﬁcally, we have the following result.

Lemma 1. If (x 0, . . . , x i 1, x i+1, . . . , x n) is a MAP con-

3Given the symmetry, one could instead equivalently consider M + conditioned on Xi = 1 but then one would need to ﬂip variables after performing inference in order to match the original interpretation in M.

Uprooting and Rerooting Graphical Models

(a) Original model M = M0 (b) Uprooted symmetric model M + (c) Equivalent model M2 rerooted at x2 n = 3 variables: x1, x2, x3 n + 1 = 4 variables: x0, x1, x2, x3 n = 3 variables: x0, x1, x3 edge + singleton potentials: θ1, θ3 < 0; θ2 > 0 only edge, no singleton potentials edge + singleton potentials: θ 0, θ 1 > 0; θ 3 < 0 frustrated cycle almost balanced attractive

Figure 1. Examples of (a) an original model M = M0 on three variables, together with (b) its unique uprooted model M +, and (c) a different rooting of M + at x2 to yield M2, where now all edges are attractive. Solid blue (dashed red) edges are attractive (repulsive).

ﬁguration for Mi, then the corresponding MAP conﬁguration for M, with the same score, is: ( m = (x 1, . . . , x i 1, xi = 0, x i+1, . . . , x n) if x 0 = 0 m = ( x 1, . . . , x i 1, xi = 1, x i+1, . . . , x n) if x 0 = 1.

4.1.2. MARGINAL INFERENCE AND COMPUTING Z

Since Mi and M have corresponding conﬁgurations with equal scores, they have the same partition function.

In order to differentiate between probabilities obtained for different models, we use the following notation: let pi be the probability distribution in model Mi, in particular p0 is the distribution for model M0 which is the original model M; let p+ be the distribution in the uprooted model M +.

Each model Mi is the result of conditioning on Xi = 0 in M +. We would like to perform (exact or approximate) inference on Mi to obtain pi, then use this to recover marginals p0(Xj = 1) j {1, . . . , n}. The following result achieves this. See Table 1 for an example. Lemma 2. Given distribution pi for any i {1, . . . , n}, the marginals p0(Xj = 1) for the original model M = M0 may be recovered as follows:

p0(Xj = 1) =

( pi(X0 = 1) j = i pi(Xj = X0) j = i.

Proof. This follows from the symmetry of M +, see the Appendix for details.

4.2. Relation to Clamping, How to Choose a Root?

Conditioning a model on a variable taking a particular value is sometimes called clamping (Eaton and Ghahramani, 2009; Weller and Jebara, 2014). Since Mi is M +

conditioned on Xi = 0, we may view uprooting from M = M0 to M + as de-clamping X0 back to a parent model; then rerooting at variable Xi is a re-clamping at Xi = 0 to obtain Mi.

In typical clamping for inference, one must condition a variable to each of its values and combine all results (for example, if estimating Z, one must sum the approximate subpartition functions). For binary variables, this requires running your inference algorithm twice. In contrast, a rooted model gets a clamping for free at the root variable, with just one inference run required.

A natural question is how to choose a good root when rerooting a model? Given the interpretation of rooting as clamping, we can draw on earlier work. Weller and Domke (2016) examined a range of heuristics and concluded that a fast method called max W typically performs very well for approximate inference.4 The idea behind max W is that many popular methods of approximate inference, such as belief propagation, are exact on acyclic models but can perform poorly when there are cycles composed of strong edge weights. It is NP-hard to identify heavy cycles but the following simple heuristic was shown to be empirically effective. For each variable, a sum of absolute values of incident edge weights is computed, then the variable with the highest sum is selected to clamp. When it is clamped, this variable is effectively removed from the model, thereby eliminating any cycles which ran through it.

4.2.1. A NEW METHOD: MAXTW

In 6, we explore the value of rerooting using the earlier max W heuristic to select the root variable. We observe that max W sometimes performs well, but one setting where it can perform poorly is if a choice must be made between one variable that has a few strong edges and another that has many weak edges. When rerooting, this may happen frequently. For example, consider an initial model M with a grid topology, and singleton potentials that are low relative to edge potentials: in M + this leads to X0 having a weak

4Weller and Domke (2016) showed that a more sophisticated variant called max W+core+TRE performed slightly better in general, but TRE is redundant for the fully symmetric M + model, and the core idea makes no difference in the experiments we run.

Uprooting and Rerooting Graphical Models

edge to every other variable, whereas other variables have few strong edges. max W simply picks the variable i with highest P

j N(i) |Wij|, where N(i) is the set of variables adjacent to i. However, the direct inﬂuence of a strong edge weight does not keep increasing linearly with its weight, rather it reaches a hard saturation level (Weller and Jebara, 2014, Supplement, Lemma 12). Here we address this by introducing an alternative heuristic we call maxt W, which picks the variable i with max P

j N(i) tanh | Wij

4 |. This form was chosen based on earlier work on loop series expansions (Weller et al., 2014; Sudderth et al., 2007). See results in 6 where maxt W can dramatically outperform max W on grids.

5. Implications of Rerooting

Rerooting provides a conceptual framework to view singleton and edge potentials as essentially equivalent except for a choice of rooting of the symmetric uprooted M + parent model. After rerooting, it may be possible to apply many methods or bounds that were unavailable for the original model M. We consider important examples below.

5.1. MAP Inference

The success of many existing methods of MAP inference depends critically on the nature of the edge potentials of a model, but can be relatively insensitive to the singleton potentials. For example, both the max ﬂow/min cut method (Greig et al., 1989) and the basic linear programming (LP) relaxation over the local polytope LOC (Wainwright and Jordan, 2008) provide an exact solution in polynomial-time if the model is attractive. These approaches generalize to balanced models, see 2.1.

With rerooting, these methods can now be used on the signiﬁcantly larger class of model where some rooting Mi exists which is balanced. This holds iff the uprooted model M + is almost balanced, which means it contains a variable such that removing it renders the remaining model balanced. See Figure 1 for an example.

Almost balanced models have received recent attention. Jebara (2009) introduced a method for MAP inference via a reduction to the graph-theoretic challenge of identifying a maximum weight stable set (MWSS) in a derived weighted graph, which if perfect, allows an exact solution to be obtained efﬁciently. Weller (2015b) showed that this method applies iff the block decomposition of the model M yields blocks (maximal 2-connected components) which are all almost balanced. With rerooting, we can extend this method to models M that have uprooted models M + that are 2-almost balanced, i.e. models which can be rendered balanced by deleting 2 variables (since by rooting at either of these variables, the rooted model is almost balanced).

Figure 2. An example of a model M which is not almost balanced, hence does not satisfy the conditions of Weller et al. (2016) for tightness of LP+TRI. Nevertheless, by Theorem 4, it is sufﬁcient if its uprooted M + is 2-almost balanced. Hence LP+TRI will always be tight for (any rerooting of) this model provided singleton potentials are not of the form: θ1, θ2 take one sign (either positive or negative); and θ3, θ4 take the other. Solid blue (dashed red) edges are attractive (repulsive). See 5.2 for details.

5.2. Local and Triplet Polytopes, Why Rooting ?

Weller et al. (2016) showed that the LP relaxation on the triplet-consistent polytope, LP+TRI, yields an exact MAP conﬁguration of a model M provided it is almost balanced (for any singleton potentials). As above this can now be generalized to be used for any model if its uprooted model M + is 2-almost balanced, since then a rooting exists which is almost balanced. In fact, we can achieve a much stronger result due to the following remarkable property of TRI. Theorem 3 (TRI is universally rooted ). LP+TRI yields the same optimum score for M as for any rerooting Mi; hence LP+TRI is either tight for all rerootings or for none.

Theorem 3 immediately yields the following new result. Theorem 4. LP+TRI is tight for (any rerooting of) a model M whose uprooted model M + is 2-almost balanced.

See the Appendix 8 for details and proofs. This beautifully shows the common nature of edge and singleton potentials for TRI, examining the signs of all edges in M + in the same symmetric way.

Theorem 4 helps us to understand tightness of LP+TRI on real-world vision tasks, where learned models are close to attractive due to contiguity of objects. As a small example, Theorem 4 shows that LP+TRI is tight for the model shown in Figure 2, despite it not being almost balanced, provided the signs of singleton potentials leave M + 2-almost balanced (this holds for all values unless: θ1, θ2 take one sign (positive or negative); and θ3, θ4 take the other).

Here we sketch the reasoning. The following polytopes are equivalent (see Deza and Laurent, 1997):

n variables + n 2 edges n+1 2 edges Marginal polytope of M Cut polytope of M +

TRI relaxation MET relaxation LOC relaxation RMET relaxation

MET, the semimetric polytope relaxation of the cut

Uprooting and Rerooting Graphical Models

polytope, enforces triplet constraints on every triplet of {X0, . . . , Xn}. In contrast, RMET, the rooted semimetric polytope, enforces these same triplet constraints only on triplets that include the root X0 variable.

This explains the name rooting. LOC is equivalent to a speciﬁcally rooted RMET polytope, which is why approaches over LOC (including many message-passing algorithms) deal differently with singleton and edge potentials, and might beneﬁt signiﬁcantly from rerooting. TRI, however, is equivalent to MET, which deals symmetrically with all variables in M + and corresponds to a simultaneous rooting at every variable. This intriguing observation likely has further theoretical and algorithmic consequences, which we aim to explore in future work.

5.3. Belief Propagation

Belief propagation (BP, Pearl, 1988), or more generally the Bethe approximation (Yedidia et al., 2000), is a widely used approach for approximate inference, guaranteed to yield exact results in linear time for models without cycles. When applied to models with cycles, it often yields strikingly accurate results but may fail to converge altogether.

Much work has analyzed the convergence of BP, and the uniqueness of a ﬁxed point, relying either on the strength of edge interactions (Heskes, 2004; Mooij and Kappen, 2005), or just on their signs (Watanabe, 2011). Mooij and Kappen (2007) reﬁned their earlier result by considering also singleton potentials, but these are incorporated quite differently to the edge potentials. Hence, by rerooting it may be possible to provide theoretical guarantees on performance that are not available on the initial model.

As one example, it is known that if a model has one cycle, then the Bethe free energy is convex and BP has a unique ﬁxed point (Pakzad and Anantharam, 2002). Consider the model shown in Figure 1. The original model (a) is a frustrated cycle, hence the BP estimate of Z will be too high, with unbounded high error as edge weights increase (Weller, 2015a, 6.3). In contrast, the rerooted model (c) is attractive, hence the BP estimate is always in the range [Z/2, Z] for any potentials (Weller and Jebara, 2014).

5.4. FPRAS, Bounds

Jerrum and Sinclair (1993) devised a fully polynomial-time randomized approximation scheme (FPRAS) for the partition function of a model M provided it is attractive and all singleton potentials are consistent in taking the same sign (positive or negative). This generalizes to any model with uprooted model M + which is balanced (see 2.1).

Various methods have been developed to bound the partition function or marginals of a model (Leisink and Kappen, 2003; Ihler, 2007; Mooij and Kappen, 2008). These

treat singleton and edge potentials differently, hence may be generalized by considering rerootings.

5.5. Remarks

Comparison to clamping. Some of the beneﬁts of rerooting could also be obtained by usual clamping of M. For example, if a model can be rendered balanced by rerooting at Xi, then this could also be achieved by clamping Xi in the original model. However, this would require performing multiple inference runs and combining results, rather than using the free clamping available with a rerooting, see 4.2. Further, several results, including Theorem 4 and the observations in 5.2 on the triplet polytope, are not possible without considering rerooting.

Evaluation of inference methods. Approximate inference methods are typically evaluated empirically on a range of models, where singleton and edge potentials are treated quite differently. Often singleton potentials are drawn from some ﬁxed narrow range while edge potentials are drawn from a range whose width is varied widely. From an uprooted model perspective, singleton and edge potentials are equivalent. Hence: (i) Varying singleton and edge potentials differently in empirical evaluations may be a peculiar assumption, though it could be justiﬁed as reﬂecting typical patterns in the real world; (ii) The implicit choice of root may be poor in some cases (i.e. results might be improved signiﬁcantly by rerooting), which will obscure the underlying performance attributes of the inference method. We examine the extent of this effect in 6.

6. Experiments

Following the observation in 5.5, we are interested in the effect of rerooting in standard settings for empirical evaluation. We compared performance of estimating the partition function and singleton marginals after different rerootings of three popular approximation methods: Bethe (using the approach of Heskes et al., 2003 to ensure convergence), tree-reweighted (TRW, Wainwright et al., 2005) and naive mean ﬁeld. For true values, we used the junction tree algorithm. All methods were implemented using lib DAI (Mooij, 2010), see the Appendix 9 for details.

We ran experiments on the following topologies and model sizes: complete graphs on 10 and 15 variables; grids of size 5 5 and 9 9. All potentials were drawn randomly: mixed models used Wij U[ Wmax, Wmax], attractive models used Wij U[0, Wmax], as Wmax was varied; singleton potentials were drawn either from a low range θi [ 0.1, 0.1], medium range θi [ 2, 2], or from a range commensurate with edge potentials, i.e. θi U[ Wmax/2, Wmax/2], with the factor of 2 needed given the form of (1). These settings allow direct compar-

Uprooting and Rerooting Graphical Models

ison to earlier work such as by Weller and Domke (2016) or Weller et al. (2014). Others (Meshi et al., 2009; Sontag and Jaakkola, 2007) use binary variables with values in { 1, 1} instead of {0, 1}, hence their edge (singleton) potentials should be multiplied by 4 (2, respectively) when making comparisons. We plot average error over 100 random runs for each setting. All results are in the Appendix.

As in 4.2, any rooting of a model M may be considered a clamping of the uprooted model M +. The original model M = M0 implicitly reﬂects the decision to clamp at X0, which might be a good or bad choice depending on the setting. Recall from 4.2 that max W often performs well for selecting a variable to clamp, picking one with highest sum of incident edge strengths (taking absolute values). However, if a choice must be made between variable A with many weak edges, or B with few strong edges, max W may make a poor choice by not recognizing that A is often better since the inﬂuence of strong edges saturates. Hence we introduced the maxt W heuristic in 4.2.1, which selects variable Xi with max P

j N(i) tanh | Wij

Our plots show average error when applying the approximate inference method to: the original model M; the uprooted model M +; then rerootings at: the worst variable, the best variable, the max W variable, and the maxt W variable. Best and worst always refer to the variable at which rerooting gave with hindsight the best and worst error for the partition function (even in plots for marginals).

6.1. Results

Figure 3 summarizes results for Bethe, typically the most accurate method. Looking across all results (see Appendix 9), we make the following observations.

For complete graphs, max W and maxt W perform well. Rerooting is very effective as edge strength grows, both at low and medium levels of singleton potentials. This makes sense, since in this setting, the default rooting at X0 has relatively weak edges, and all variables in M + have the same number of edges, hence it is likely to be very beneﬁcial to switch to a different root with stronger edges. When singleton and edge potentials vary together, edges in M + are all similar, but X0 is an average variable to clamp, whereas we do somewhat better by choosing a good variable.

For grids, maxt W is much better than max W (max W performs very poorly in some cases), appearing to handle uneven edge weights in M + well. At low singleton potentials, rerooting is very helpful but this beneﬁt disappears for stronger singleton potentials, where the original rooting performs equally to maxt W.

Results for MF and TRW are qualitatively similar to Bethe, with Bethe typically performing best. For mixed models with strong edges, MF performs very well. This is

likely due to MF optimizing within the marginal polytope, whereas Bethe and TRW use the local polytope, in which strong frustrated cycles can lead to high error.

Based on maxt W, we can suggest a guideline for when rerooting is likely to be helpful. For example, for a 4-way grid with n variables, constant singleton potentials T and edge weights W: 4 tanh W

4 + tanh 2T

4 > n tanh 2T

4 > (n 1) tanh T

2 . This is conservative since more randomness increases the value of rerooting by raising the chance of a better root. Demonstrating this, observe in Figure 3 that when singleton potentials are low, the improvement in log Z estimate from rerooting using maxt W is about the same for 9 9 grids as for smaller 5 5 grids.

7. Conclusion

We introduced the idea of uprooting and then rerooting any binary pairwise graphical model. This immediately leads to a meta-algorithm for inference into which any existing approach may be slotted, and generalizes important theoretical results. Further, it provides an elegant conceptual framework for rethinking singleton and edge potentials with methodological consequences for how we evaluate models and methods. One application in 5.2 leads to Theorem 4, a strong result for tightness of LP relaxations on the triplet-consistent polytope TRI, and a remarkable interpretation of TRI as universally rooted.

Rerooting switches an implicit clamp choice in the uprooted model at X0 (perhaps a poor choice), instead to a carefully selected clamp choice, almost for free. This applies even for large models where it is desirable to clamp a series of variables: by rerooting, we may obtain one of the series for free, sometimes achieving dramatic improvements in accuracy with little work required. If there are multiple connected components, each should be handled separately, with its own X0-type variable. This could be useful for (repeatedly) composing clamping and then rerooting each separated component.

Rerooting is particularly effective when a model has dense, strong edge weights and weak singleton potentials (a difﬁcult setting for many existing methods). Our new maxt W heuristic performs particularly well in this setting (and should also be helpful for standard clamping approaches), sometimes dramatically outperforming the earlier max W method. maxt W also provides a useful guideline for when uprooting is likely to be helpful, see the last paragraph of 6.1.

It will be interesting in future work to study further consequences of our interpretation of the triplet-consistent polytope, to consider the value of rerooting for approaches to learning graphical models, and to explore the beneﬁts of rerooting when variables have a higher number of labels.

Uprooting and Rerooting Graphical Models

Complete graphs (fully connected) max W and maxt W both perform well complete graph on 10 variables, mixed potentials, error of log Z

error of log Z

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

complete graph on 15 variables, mixed potentials, error of log Z

error of log Z

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

complete graph on 15 variables, mixed potentials, l1 error of marginals

l1 error of marginals

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

low singleton θi [ 0.1, 0.1] medium singleton θi [ 2, 2] singleton + edge potentials scale together

Grid graphs (toroidal) maxt W performs much better than max W 5 5 grid, mixed potentials, error of log Z

error of log Z

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

9 9 grid, mixed potentials, error of log Z

error of log Z

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

maximum edge strength Wmax

2 4 8 12 16 0

Original M M+ worst max W maxt W best

low singleton θi [ 0.1, 0.1] medium singleton θi [ 2, 2] singleton + edge potentials scale together

Figure 3. Average error of Bethe approximation for models with mixed potentials over 100 runs, showing smaller and larger models for comparison. Top: complete graphs (10 and 15 variables). Bottom: toroidal grid graphs (5 5 and 9 9). Each column shows different settings for singleton potentials: left is low range; centre is medium range; right varies singleton and edge potentials together. See 6.

Uprooting and Rerooting Graphical Models

Acknowledgments

The author thanks Justin Domke, Tony Jebara, Mark Rowland, Nilesh Tripuraneni, David Sontag and the anonymous reviewers for helpful comments.

F. Barahona and A. Mahjoub. On the cut polytope. Mathematical Programming, 36(2):157 173, 1986.

F. Barahona, M. Gr otschel, M. J unger, and G. Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493 513, 1988.

A. Blake, P. Kohli, and C. Rother, editors. Markov Random Fields for Vision and Image Processing. MIT Press, 2011.

M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer Publishing Company, Incorporated, 1st edition, 1997. ISBN 978-3-642-04294-2.

F. Eaton and Z. Ghahramani. Choosing a variable to clamp: Approximate inference using conditioned belief propagation. In Artiﬁcial Intelligence and Statistics, 2009.

F. Eaton and Z. Ghahramani. Model reductions for inference: Generality of pairwise, binary, and planar factor graphs. Neural Computation, 25(5):1213 1260, 2013.

D. Greig, B. Porteous, and A. Seheult. Exact maximum a posteriori estimation for binary images. J. Royal Statistical Soc., Series B, 51(2):271 279, 1989.

F. Harary. On the notion of balance of a signed graph. Michigan Mathematical Journal, 2:143 146, 1953.

F. Harary and J. Kabell. A simple algorithm to detect balance in signed graphs. Mathematical Social Sciences, 1(1):131 136, 1980.

T. Heskes. On the uniqueness of loopy belief propagation ﬁxed points. Neural Computation, 16(11):2379 2413, 2004.

T. Heskes, K. Albers, and B. Kappen. Approximate inference and constrained optimization. In UAI, pages 313 320, 2003.

A. Ihler. Accuracy bounds for belief propagation. In Uncertainty in Artiﬁcial Intelligence (UAI), 2007.

T. Jebara. MAP estimation, message passing, and perfect graphs. In Uncertainty in Artiﬁcial Intelligence, 2009.

M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J. Comput., 22(5):1087 1116, 1993.

M. Leisink and H. Kappen. Bound propagation. J. Artif. Intell. Res. (JAIR), 19:139 154, 2003.

O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. Convexifying the Bethe free energy. In UAI, pages 402 410, 2009.

J. Mooij. lib DAI: A free and open source C++ library for discrete approximate inference in graphical models. Journal of Machine Learning Research, 11:2169 2173, August 2010. URL http://www.jmlr.org/papers/ volume11/mooij10a/mooij10a.pdf.

J. Mooij and H. Kappen. On the properties of the Bethe approximation and loopy belief propagation on binary networks. Journal of Statistical Mechanics: Theory and Experiment, 2005.

J. Mooij and H. Kappen. Sufﬁcient conditions for convergence of the sum-product algorithm. IEEE Transactions on Information Theory, 53(12):4422 4437, December 2007.

J. M. Mooij and H. J. Kappen. Bounds on marginal probability distributions. In Neural Information Processing Systems, pages 1105 1112, 2008.

P. Pakzad and V. Anantharam. Belief propagation and statistical physics. In Princeton University, 2002.

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

D. Sontag. Cutting plane algorithms for variational inference in graphical models. Master s thesis, MIT, EECS, 2007.

D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In NIPS, 2007.

E. Sudderth, M. Wainwright, and A. Willsky. Loop series and Bethe variational bounds in attractive graphical models. In NIPS, 2007.

M. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008.

M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313 2335, 2005.

Y. Watanabe. Uniqueness of belief propagation on signed graphs. In Neural Information Processing Systems, 2011.

A. Weller. Bethe and related pairwise entropy approximations. In Uncertainty in Artiﬁcial Intelligence (UAI), 2015a.

A. Weller. Revisiting the limits of MAP inference by MWSS on perfect graphs. In Artiﬁcial Intelligence and Statistics (AISTATS), 2015b.

A. Weller. Characterizing tightness of LP relaxations by forbidding signed minors. In UAI, 2016.

A. Weller and J. Domke. Clamping improves TRW and mean ﬁeld approximations. In Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

A. Weller and T. Jebara. Clamping variables and approximate inference. In Neural Information Processing Systems (NIPS), 2014.

A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Bethe approximation: When and how can it go wrong? In Uncertainty in Artiﬁcial Intelligence (UAI), 2014.

A. Weller, M. Rowland, and D. Sontag. Tightness of LP relaxations for almost balanced models. In Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

M. Welling and Y. Teh. Belief optimization for binary networks: A stable alternative to loopy belief propagation. In Uncertainty in Artiﬁcial Intelligence (UAI), 2001.

J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In Advances in Neural Information Processing Systems (NIPS), 2000.