# uprooting_and_rerooting_higherorder_graphical_models__defb49f1.pdf

Uprooting and Rerooting Higher-Order Graphical Models

Mark Rowland University of Cambridge mr504@cam.ac.uk

Adrian Weller University of Cambridge and Alan Turing Institute aw665@cam.ac.uk

The idea of uprooting and rerooting graphical models was introduced speciﬁcally for binary pairwise models by Weller [19] as a way to transform a model to any of a whole equivalence class of related models, such that inference on any one model yields inference results for all others. This is very helpful since inference, or relevant bounds, may be much easier to obtain or more accurate for some model in the class. Here we introduce methods to extend the approach to models with higher-order potentials and develop theoretical insights. In particular, we show that the triplet-consistent polytope TRI is unique in being universally rooted . We demonstrate empirically that rerooting can signiﬁcantly improve accuracy of methods of inference for higher-order models at negligible computational cost.

1 Introduction

Undirected graphical models with discrete variables are a central tool in machine learning. In this paper, we focus on three canonical tasks of inference: identifying a conﬁguration with highest probability (termed maximum a posteriori or MAP inference), computing marginal probabilities of subsets of variables (marginal inference) and calculating the normalizing constant (partition function). All three tasks are typically computationally intractable, leading to much work to identify settings where exact polynomial-time methods apply, or to develop approximate algorithms that perform well.

Weller [19] introduced an elegant method which ﬁrst uproots and then reroots a given model M to any of a whole class of rerooted models {Mi}. The method relies on speciﬁc properties of binary pairwise models and makes use of an earlier construction which reduced MAP inference to the MAXCUT problem on the suspension graph G (1; 2; 12; 19, see 3 for details). For many important inference tasks, the rerooted models are equivalent in the sense that results for any one model yield results for all others with negligible computational cost. This can be very helpful since various models in the class may present very different computational difﬁculties for inference.

Here we show how the idea may be generalized to apply to models with higher-order potentials over any number of variables. Such models have many important applications, for example in computer vision [6] or modeling protein interactions [5]. As for pairwise models, we again obtain signiﬁcant beneﬁts for inference. We also develop a deeper theoretical understanding and derive important new results. We highlight the following contributions: In 34, we show how to achieve efﬁcient uprooting and rerooting of binary graphical models with potentials of any order, while still allowing easy recovery of inference results. In 5, to simplify the subsequent analysis, we introduce pure k-potentials for any order k, which may be of independent interest. We show that there is essentially only one pure k-potential which we call the even k-potential, and that even k-potentials form a basis for all model potentials. In 6, we carefully analyze the effect of uprooting and rerooting on Sherali-Adams [11] relaxations Lr of the marginal polytope, for any order r. One surprising observation in 6.2 is that L3 (the

Authors contributed equally.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

triplet-consistent polytope or TRI) is unique in being universally rooted, in the sense that there is an afﬁne score-preserving bijection between L3 for a model and L3 for each of its rerootings. In 7, our empirical results demonstrate that rerooting can signiﬁcantly improve accuracy of inference in higher-order models. We introduce effective heuristics to choose a helpful rerooting.

Our observations have further implications for the many variational methods of marginal inference which optimize the sum of score and an entropy approximation over a Sherali-Adams polytope relaxation. These include the Bethe approximation (intimately related to belief propagation) and cluster extensions, tree-reweighted (TRW) approaches and logdet methods [12; 14; 16; 22; 24].

1.1 Background and discussion of theoretical contributions

Based on earlier connections in [2], [19] showed the remarkable result for pairwise models that the triplet-consistent polytope (L3 or TRI) is universally rooted (in the restricted sense deﬁned in [19, Theorem 3]). This observation allowed straightforward strengthening of previously known results, for example: it was previously shown [23] that the LP relaxation on TRI (LP+TRI) is always tight for an almost-balanced binary pairwise model, that is a model which can be rendered balanced by removing one variable [17]. Given [19, Theorem 3], this earlier result could immediately be signiﬁcantly strengthened to [19, Theorem 4], which showed that LP+TRI is tight for a binary pairwise model provided only that some rerooting exists such that the rerooted model is almost balanced.

Following [19], it was natural to suspect that the universal rootedness property might hold for all (or at least some) Lr, r 3. This would have impact on work such as [10] which examines which signed minors must be forbidden to guarantee tightness of LP+L4. If L4 were universally rooted, then it would be possible to simplify signiﬁcantly the analysis in [10].

Considering this issue led to our analysis of the mappings to symmetrized uprooted polytopes given in our Theorem 17. We believe this is the natural generalization of the lower order relationships of L2 and L3 to RMET and MET described in [2], though this direction was not clear initially.

With this formalism, together with the use of even potentials, we demonstrate our Theorems 20 and 21, showing that in fact TRI is unique in being universally rooted (and indeed in a stronger sense than given in [19]). We suggest that this result is surprising and may have further implications.

As a consequence, it is not possible to generate some quick theoretical wins by generalizing previous results as [19] did to derive their Theorem 4, but on the other hand we observe that rerooting may be helpful in practice for any approach using a Sherali-Adams relaxation other than L3. We verify the potential for signiﬁcant beneﬁts experimentally in 7.

2 Graphical models

A discrete graphical model M[G(V, E), (θE)E E] consists of: a hypergraph G = (V, E), which has n vertices V = {1, . . . , n} corresponding to the variables of the model, and hyperedges E P(V ), where P(V ) is the powerset of V ; together with potential functions (θE)E E over the hyperedges E E. We consider binary random variables (Xv)v V with each Xv Xv = {0, 1}. For a subset U V , x U {0, 1}U is a conﬁguration of those variables (Xv)v U. We write x U for the ﬂipping of x U, deﬁned by xi = 1 xi i U. The joint probability mass function factors as follows, where the normalizing constant Z = P

x V {0,1}V exp(score(x V )) is the partition function:

p(x V ) = 1

Z exp (score(x V )) , score(x V ) = X

E E θE(x E). (1)

3 Uprooting and rerooting

Our goal is to map a model M to any of a whole family of models {Mi} in such a way that inference on any Mi will allow us easily to recover inference results on the original model M. In this section we provide our mapping, then in 4 we explain how to recover inference results for M.

The uprooting mechanism used by Weller [19] ﬁrst reparametrizes edge potentials to the form θij(xi, xj) = 1

2Wij1[xi = xj], where 1[ ] is the indicator function (a reparameterization modiﬁes

M4 = M +|X4=0

M2 = M +|X2=0 Figure 1: Left: The hypergraph G of a graphical model M over 4 variables, with potentials on the hyperedges {1, 2}, {1, 3, 4}, and {2, 4}. Center-left: The suspension hypergraph G of the uprooted model M +. Centerright: The hypergraph G\{4} of the rerooted model M4 = M +|X4=0, i.e. M + with X4 clamped to 0. Right: The hypergraph G \ {2} of the rerooted model M2 = M +|X2=0, i.e. M + with X2 clamped to 0.

potential functions such that the complete score of each conﬁguration is unchanged, see 15 for details). Next, singleton potentials are converted to edge potentials with this same form by connecting to an added variable X0. This mechanism had been used previously to reduce MAP inference on M to MAXCUT on the converted model [1; 12], and applies speciﬁcally only to binary pairwise models.

We introduce a generalized construction which applies to models with potentials of any order. We ﬁrst uproot a model M to a highly symmetric uprooted model M + where an extra variable X0 is added, in such a way that the original model M is exactly M + with X0 clamped to the value 0. Since X0 is clamped to retrieve M, we may write M = M0 := M +|X0=0. Alternatively, we can choose instead to clamp a different variable Xi in M + which will lead to the rerooted model Mi := M +|Xi=0. Deﬁnition 1 (Clamping). For a graphical model M[G = (V, E), (θE)E E], and i V , the model M|Xi=a obtained by clamping the variable Xi to the value a Xi is given by: the hypergraph (V \ {i}, Ei), where Ei = {E \ {i}|E E}; and potentials which are unchanged for hyperedges which do not contain i, while if i E then θE\{i}(x E\{i}) = θE(x E\{i}, xi = a).

Deﬁnition 2 (Uprooting, suspension hypergraph). Given a model M[G(V, E), (θE)E E], the uprooted model M + adds a variable X0, which is added to every hyperedge of the original model. M + has hypergraph G, with vertex set V + = V {0} and hyperedge set E+ = {E+ = E {0}|E E}. G is the suspension hypergraph of G. M + has potential functions (θ+ E {0})E E given by

θ+ E {0}(x E {0}) = θE(x E) if x0 = 0 θE(x E) if x0 = 1.

With this deﬁnition, all uprooted potentials are symmetric in that θ+ E+(x E+) = θ+ E+(x E+) E+ E+.

Deﬁnition 3 (Rerooting). From Deﬁnition 2, we see that given a model M, if we uproot to M + then clamp X0 = 0, we recover the original model M. If instead in M + we clamp Xi = 0 for any i = 1, . . . , n, then we obtain the rerooted model Mi := M +|Xi=0.

See Figure 1 and Table 1 for examples of uprooting and rerooting. We explore the question of how to choose a good variable for rerooting (i.e. how to choose a good variable to clamp in M +) in 7.

4 Recovery of inference tasks

Here we demonstrate that the partition function, MAP score and conﬁguration, and marginal distributions for a model M, can all be recovered from its uprooted model M + or any rerooted model Mi i V , with negligible computational cost. We write Vi = {0, 1, . . . , n} \ {i} for the variable set of rerooted model Mi; scorei(x Vi) for the score of x Vi in Mi; and pi for the probability distribution for Mi. We use superscript + to indicate the uprooted model. For example, the probability distribution for M + is given by p+(x V +) = 1 Z+ exp P E E+ θE(x E) . From the deﬁnitions of 3, we obtain the following key lemma, which is critical to enable recovery of inference results. Lemma 4 (Score-preserving map). Each conﬁguration x V of M maps to 2 conﬁgurations of the uprooted M + with the same score, i.e. from M, x V in M +, both of (x0 = 0, x V ) and (x0 = 1, x V ) with score(x V ) = score+(x0 = 0, x V ) = score+(x0 = 1, x V ). For any i V +, exactly one of the two uprooted conﬁgurations has xi = 0, and just this one will be selected in Mi. Hence, there is a score-preserving bijection between conﬁgurations of M and those of Mi:

For any i V + : in M, x V in Mi, (x0 = 0, x V \{i}) if xi = 0 (x0 = 1, x V \{i}) if xi = 1. (2)

M conﬁg M + conﬁguration M4 conﬁg x1 x3 x4 x0 x1 x3 x4 x0 x1 x3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1

Table 1: An illustration of how scores of potential θ134 on hyperedge {1, 3, 4} in an original model M map to potential θ0134 in M + and then to θ013 in M4. See Figure 1 for the hypergraphs. Each color indicates a value of θ134(x1, x3, x4) for a different conﬁguration (x1, x3, x4). Note that M + has 2 rows of each color, while after rerooting to M4, we again have exactly one row of each color. The 1-1 score preserving map between conﬁgurations of M and any Mi is critical to enable recovery of inference results; see Lemma 4.

Table 1 illustrates this perhaps surprising result, from which the next two propositions follow.

Proposition 5 (Recovering the partition function). Given a model M[G(V, E), (θE)E E] with partition function Z as in (1), the partition function Z+ of the uprooted model M + is twice Z, and the partition function of each rerooted model Mi is exactly Z, for any i V .

Proposition 6 (Recovering a MAP conﬁguration). From M +: x V is an arg max for p iff (x0 = 0, x V ) is an arg max for p+ iff (x0 = 1, x V ) is an arg max for p+. From a rerooted model Mi: (x V \{i}, xi = 0) is an arg max for p iff (x0 = 0, x V \{i}) is an arg max for pi; (x V \{i}, xi = 1) is an arg max for p iff (x0 = 1, x V \{i}) is an arg max for pi.

We can recover marginals as shown in the following proposition, proof in the Appendix 9.1.

Proposition 7 (Recovering marginals). For a subset = U V , we can recover from M +: p(x U) = p+(x0 = 0, x U) + p+(x0 = 1, x U) = 2p+(x0 = 0, x U) = 2p+(x0 = 1, x U). To recover from a rerooted Mi: (i) For any i V \ U, p(x U) = pi(x0 = 0, x U) + pi(x0 = 1, x U).

(ii) For any i U, p(x U) = pi(x0 = 0, x U\{i}) xi = 0 pi(x0 = 1, x U\{i}) xi = 1.

In 6, we provide a careful analysis of the impact of uprooting and rerooting on the Sherali-Adams hierarchy of relaxations of the marginal polytope [11]. We ﬁrst introduce a way to parametrize potentials which will be particularly useful, and which may be of independent interest.

5 Pure k-potentials

We introduce the notion of pure k-potentials. These allow the speciﬁcation of interactions which act purely over a set of variables of a given size k, without inﬂuencing the distribution of any subsets. We show that in fact, there is essentially only one pure k-potential. Further, we show that one can express any θE potential in terms of pure potentials over E and subsets of E, and that pure potentials have appealing properties when uprooted and rerooted which help our subsequent analysis.

We say that a potential is a k-potential if k is the smallest number such that the score of the potential may be determined by considering the conﬁguration of k variables. Usually a potential θE is a k-potential with k = |E|. For example, typically a singleton potential is a 1-potential, and an edge potential is a 2-potential. However, note that k < |E| is possible if one or more variables in E are not needed to establish the score (a simple example is θ12(x1, x2) = x1, which clearly is a 1-potential).

In general, a k-potential will affect the marginal distributions of all subsets of the k variables. For example, one popular form of 2-potential is θij(xi, xj) = Wijxixj, which tends to pull Xi and Xj toward the same value, but also tends to increase each of p(Xi = 1) and p(Xj = 1). For pairwise models, a different reparameterization of potentials instead writes the score as

score(x V ) = X

i V θixi + 1

(i,j) E Wij1[xi = xj]. (3)

Expression (3) has the desirable feature that the θij(xi, xj) = 1

2Wij1[xi = xj] edge potentials affect only the pairwise marginals, without disturbing singleton marginals. This motivates the following deﬁnition. Deﬁnition 8. Let k 2, and let U be a set of size k. We say that a k-potential θU : {0, 1}U R is a pure k-potential if the distribution induced by the potential, p(x U) exp(θU(x U)), has the property that for any = W U, the marginal distribution p(x W ) is uniform.

We shall see in Proposition 10 that a pure k-potential must essentially be an even k-potential. Deﬁnition 9. Let k N, and |U| = k. An even k-potential is a k-potential θU : {0, 1}U R of the form θU(x U) = a1[ |{i U|xi = 1}| is even], for some a R which is its coefﬁcient. In words, θU(x U) takes value a if x U has an even number of 1s, else it takes value 0.

As an example, the 2-potential θij(xi, xj) = 1

2Wij1[xi = xj] in (3) is an even 2-potential with U = {i, j} and coefﬁcient Wij/2. The next two propositions are proved in the Appendix 9.2. Proposition 10 (All pure potentials are essentially even potentials). Let k 2, and |U| = k. If θU :{0, 1}U R is a pure k-potential then θU must be an afﬁne function of the even k-potential, i.e. a, b R s.t. θU(x U) = a1[ |{i U|xi = 1}| is even] + b. Proposition 11 (Even k-potentials form a basis). For a ﬁnite set U, the set of even k-potentials 1[ |{i W|Xi = 1}| is even]

W U, indexed by subsets W U, forms a basis for the vector space of all potential functions θ : {0, 1}U R.

Any constant in a potential will be absorbed into the partition function Z and does not affect the probability distribution, see (1). An even 2-potential with positive coefﬁcient, e.g. as in (3) if Wij > 0, is supermodular. Models with only supermodular potentials (equivalently, submodular cost functions) typically admit easier inference [3; 7]; if such a model is binary pairwise then it is called attractive. However, for k > 2, even k-potentials θE are neither supermodular nor submodular. Yet if k is an even number, observe that θE(x E) = θE(x E). We discuss this further in Appendix 10.4.

When a k-potential is uprooted, in general it may become a (k + 1)-potential (recall Deﬁnition 2). The following property of even k-potentials is helpful for our analysis in 6, and is easily checked. Lemma 12 (Uprooting an even k-potential). When an even k-potential θE with |E| = k is uprooted: if k is an even number, then the uprooted potential is exactly the same even k-potential; if k is odd, then we obtain the even (k + 1)-potential over E {0} with the same coefﬁcient as the original θE.

6 Marginal polytope and Sherali-Adams relaxations

We saw in Lemma 4 that there is a score-preserving 1-2 mapping from conﬁgurations of M to those of M +, and a bijection between conﬁgurations of M and any Mi. Here we examine the extent to which these score-preserving mappings extend to (pseudo-)marginal probability distributions over variables by considering the Sherali-Adams relaxations [11] of the respective marginal polytopes. These relaxations feature prominently in many approaches for MAP and marginal inference.

For U V , we write µU for a probability distribution in P({0, 1}U), the set of all probability distributions on {0, 1}U. Bold µ will represent a collection of measures over various subsets of variables. Given (1), to compute an expected score, we need (µE)E E. This motivates the following.

Deﬁnition 13. The marginal polytope M(G(V, E)) = {(µE)E E µV s.t. µV E = µE E E}, where for U1 U2 V , µU2 U1 denotes the marginalization of µU2 P({0, 1}U2) onto {0, 1}U1.

M(G) consists of marginal distributions for every hyperedge E E such that all the marginals are consistent with a global distribution over all variables V . Methods of variational inference typically

optimize either the score (for MAP inference) or the score plus an entropy term (for marginal inference) over a relaxation of the marginal polytope [15]. This is because M(G) is computationally intractable, with an exponential number of facets [2]. Relaxations from the Sherali-Adams hierarchy [11] are often used, requiring consistency only over smaller clusters of variables. Deﬁnition 14. Given an integer r 2, if a hypergraph G(V, E) satisﬁes max E E |E| r |V |, then we say that G is r-admissible, and deﬁne the Sherali-Adams polytope of order r on G by

Lr(G) = (µE)E E

(µU) U V |U|=r locally consistent, s.t. µU E = µE E U V, |U| = r ,

where a collection of measures (µA)A I (for some set I of subsets of V ) is locally consistent, or l.c., if for any A1, A2 I, we have µA1 A1 A2 = µA2 A1 A2. Each element of Lr(G) is a set of locally consistent probability measures over the hyperedges. Note that M(G) Lr(G) Lr 1(G). The pairwise relaxation L2(G) is commonly used but higher-order relaxations achieve greater accuracy, have received signiﬁcant attention [10; 13; 18; 22; 23], and are required for higher-order potentials.

6.1 The impact of uprooting and rerooting on Sherali-Adams polytopes

We introduce two variants of the Sherali-Adams polytopes which will be helpful in analyzing uprooted models. For a measure µU P({0, 1}U), we deﬁne the ﬂipped measure µU as µU(x U) = µU(x U) x U {0, 1}U. A measure µU is ﬂipping-invariant if µU = µU. Deﬁnition 15. The symmetrized Sherali-Adams polytopes for an uprooted hypergraph G(V +, E+) (as given in Deﬁnition 2), is:

e Lr( G) = (µE)E E+ Lr( G) µE = µE E E+ .

Deﬁnition 16. For any i V +, and any integer r 2 such that max E E+ |E| r |V +|, we deﬁne the symmetrized Sherali-Adams polytope of order r uprooted at i to be

e Li r( G) = (µE)E E+

(µU)i U V +

|U|=r l.c., s.t. µU E = µE E U V, |U| = r, i U µU = µU U V, |U| = r, i U

Thus, for each collection of measures over hyperedges in e Li r( G), there exist corresponding ﬂippinginvariant, locally consistent measures on sets of size r which contain i (and their subsets). Note that for any hypergraph G(V, E) and any i V +, we have e Lr+1( G) e Li r+1( G) e Lr( G). We next extend the correspondence of Lemma 4 to collections of locally-consistent probability distributions on the hyperedges of G, see the Appendix 9.3 for proof. Theorem 17. For a hypergraph G(V, E), and integer r such that max E E |E| r |V |, there is an afﬁne score-preserving bijection

Lr(G) Uproot Root At0 e L0 r+1( G) .

Theorem 17 establishes the following diagram of polytope inclusions and afﬁne bijections:

For M = M0 : Lr+1(G) Unnamed Lr(G)

Uproot y x Root At0 Uproot y x Root At0 Uproot y x Root At0

For M + : e L0 r+2( G) e Lr+1( G) e L0 r+1( G) .

A question of theoretical interest and practical importance is which of the inclusions in (4) are strict. Our perspective here generalizes earlier work. Using different language, Deza and Laurent [2] identiﬁed L2(G) with e L0 3( G), which was termed RMET, the rooted semimetric polytope; and e L3( G) with MET, the semimetric polytope. Building on this, Weller [19] considered L3(G), the triplet-consistent polytope or TRI, though only in the context of pairwise potentials, and showed that L3(G) has the remarkable property that if it is used to optimize an LP for a model M on G, the exact same optimum is achieved for L3(Gi) for any rerooting Mi. It was natural to conjecture that Lr(G) might have this same property for all r > 3, yet this was left as an open question.

6.2 L3 is unique in being universally rooted

We shall ﬁrst strengthen [19] to show that L3 is universally rooted in the following stronger sense.

Deﬁnition 18. We say that the rth-order Sherali-Adams relaxation is universally rooted (and write Lr is universally rooted for short) if for all admissible hypergraphs G, there is an afﬁne scorepreserving bijection between Lr(G) and Lr(Gi), for each rerooted hypergraph (Gi)i V .

If Lr is universally rooted, this applies for potentials over up to r variables (the maximum which makes sense in this context), and clearly it implies that optimizing score over any rerooting (as in MAP inference) will attain the same objective. The following result is proved in the Appendix 9.3.

Lemma 19. If Lr is universally rooted for hypergraphs of maximum hyperedge degree p < r with p even, then Lr is also universally rooted for r-admissible hypergraphs with maximum degree p + 1.

The proof relies on mapping to the symmetrized uprooted polytope e L0 r+1( G). Then by considering marginals using a basis equivalent to that described in Proposition 11 for even k-potentials, we observe that the symmetry of the polytope enforces only one possible marginal for (p + 1)-clusters.

Combining Lemma 19 with arguments which extend those used by [19] demonstrates the following result, proved in the Appendix.

Theorem 20. L3 is universally rooted.

We next provide a striking and rather surprising result, see the Appendix for proof and details.

Theorem 21. L3 is unique in being universally rooted. Speciﬁcally, for any integer r > 1 other than r = 3, we constructively demonstrate a hypergraph G(V, E) with |V | = r + 1 variables for which e L0 r+1( G) = e Li r+1( G) for any i V .

Theorem 21 examines e L0 r+1( G) and e Li r+1( G), which by Theorem 17 are the uprooted equivalents of Lr(G) and Lr(Gi). It might appear more satisfying to try to demonstrate the result directly for the rooted polytopes, i.e. to show Lr(G) = Lr(Gi). However, in general the rooted polytopes are not comparable: an r-potential in M can map to an (r + 1)-potential in M + and then to an (r + 1)-potential in Mi which cannot be evaluated for an Lr polytope.

Theorem 21 shows that we may hope for beneﬁts from rerooting for any inference method based on a Sherali-Adams relaxed polytope Lr, unless r = 3.

7 Experiments

Here we show empirically the beneﬁts of uprooting and rerooting for approximate inference methods in models with higher-order potentials. We introduce an efﬁcient heuristic which can be used in practice to select a variable for rerooting, and demonstrate its effectiveness.

We compared performance after different rerootings of marginal inference (to guarantee convergence we used the double loop method of Heskes et al. [4], which relates to generalized belief propagation, 24) and MAP inference (using loopy belief propagation, LBP [9]). For true values, we used the junction tree algorithm. All methods were implemented using lib DAI [8]. We ran experiments on complete hypergraphs (with 8 variables) and toroidal grid models (5 5 variables). Potentials up to order 4 were selected randomly, by drawing even k-potentials from Unif([ Wmax, Wmax]) distributions for a variety of Wmax parameters, as shown in Figure 2, which highlights results for estimating log Z. For each regime of maximum potential values, we plot results averaged over 20 runs. For additional details and results, including marginals, other potential choices and larger models, see Appendix 10.

We display average error of the inference method applied to: the original model M; the uprooted model M +; then rerootings at: the worst variable, the best variable, the K heuristic variable, and the G heuristic variable. Best and worst always refer to the variable at which rerooting gave with hindsight the best and worst error for the partition function (even in plots for other measures).

7.1 Heuristics to pick a good variable for rerooting

From our Deﬁnition 3, a rerooted model Mi is obtained by clamping the uprooted model M + at variable Xi. Hence, selecting a good variable for rerooting is exactly the choice of a good variable to clamp in M +. Considering pairwise models, Weller [19] reﬁned the max W method [20; 21] to introduce the maxt W heuristic, and showed that it was very effective empirically. maxt W selects the variable Xi with max P

j N(i) tanh | Wij

4 |, where N(i) is the set of neighbors of i in the model graph, and Wij is the strength of the pairwise interaction.

The intuition for maxt W is as follows. Pairwise methods of approximate inference such as Bethe are exact for models with no cycles. If we could, we would like to break tight cycles with strong edge weights, since these lead to error. When a variable is clamped, it is effectively removed from the model. Hence, we would like to reroot at a variable that sits on many cycles with strong edge weights. Identifying such cycles is NP-hard, but the maxt W heuristic attempts to do this by looking only locally around each variable. Further, the effect of a strong edge weight saturates [21]: a very strong edge weight Wij effectively locks its end variables (either together or opposite depending on the sign of Wij), and this effect cannot be signiﬁcantly increased even by an extremely strong edge. Hence the tanh function was introduced to the earlier max W method, leading to the maxt W heuristic.

As observed in 5, if we express our model potentials in terms of pure k-potentials, then the uprooted model will only have pure k-potentials for various values of k which are even numbers. Intuitively, the higher the coefﬁcients on these potentials, the more tightly connected is the model leading to more challenging inference. Hence, a natural way to generalize the maxt W approach to handle higher-order potentials is to pick a variable Xi in M + which maximizes the following measure:

clamp-heuristic-measure(i) = X

i E:|E|=2 c2 tanh |t2a E| + X

i E:|E|=4 c4 tanh |t4a E|, (5)

where a E is the coefﬁcient (weight) of the relevant pure k-potential, see Deﬁnition 9, and the {c2, t2}, {c4, t4} terms are constants for pure 2-potentials and for pure 4-potentials respectively. This approach extends to potentials of higher orders by adding similar further terms. Since our goal is to rank the measures for each i V +, without loss of generality we take c2 = 1. We ﬁt the t2, c4 and t4 constants to the data from our experimental runs, see the Appendix for details. Our K heuristic was ﬁt only to runs for complete hypergraphs while the G heuristic was ﬁt only to runs for models on grids.

7.2 Observations on results

Considering all results across models and approximate methods for estimating log Z, marginals and MAP inference (see Figure 2 and Appendix 10.3), we make the following observations. Both K and G heuristics perform well (in and out of sample): they never hurt materially and often signiﬁcantly improve accuracy, attaining results close to the best possible rerooting. Since our two heuristics achieve similar performance, sensitivity to the exact constants in (5) appears low. We veriﬁed this by comparing to maxt W for pairwise models as in [19]: both K and G heuristics performed just slightly better than maxt W. For all our runs, inference on rerooted models took similar time as on the original model (time required to reroot and later to map back inference results is negligible), see 10.3.1.

Observe that stronger 1-potentials tend to make inference easier, pulling each variable toward a speciﬁc setting, and reducing the beneﬁts from rerooting (left column of Figure 2). Stronger pure k-potentials for k > 1 intertwine variables more tightly: this typically makes inference harder and increases the gains in accuracy from rerooting. The pure k-potential perspective facilitates this analysis.

When we examine larger models, or models with still higher order potentials, we observe qualitatively similar results, see Appendix 10.3.4 and 10.3.6.

8 Conclusion

We introduced methods which broaden the application of the uprooting and rerooting approach to binary models with higher-order potentials of any order. We demonstrated several important theoretical insights, including Theorems 20 and 21 which show that L3 is unique in being universally rooted. We developed the helpful tool of even k-potentials in 5, which may be of independent

Average abs(error) in log Z for K8 complete hypergraphs (fully connected) on 8 variables.

Average abs(error) in log Z for Grids on 5 5 variables (toroidal). Legends are consistent across all plots.

vary Wmax for 1-pots vary Wmax for 2-pots vary Wmax for 3-pots vary Wmax for 4-pots

Figure 2: Error in estimating log Z for random models with various pure k-potentials over 20 runs. If not shown, Wmax max coefﬁcients for pure k-potentials are 0 for k = 1, 8 for k = 2, 0 for k = 3, 8 for k = 4. Where the red K heuristic curve is not visible, it coincides with the green G heuristic. Both K and G heuristics for selecting a rerooting work well: they never hurt and often yield large beneﬁts. See 7 for details.

interest. We empirically demonstrated signiﬁcant beneﬁts for rerooting in higher-order models particularly for the hard case of strong cluster potentials and weak 1-potentials and provided an efﬁcient heuristic to select a variable for rerooting. This heuristic is also useful to indicate when rerooting is unlikely to be helpful for a given model (if (5) is maximized by taking i = 0).

It is natural to compare the effect of rerooting M to Mi, against simply clamping Xi in the original model M. A key difference is that rerooting achieves the clamping at Xi for negligible computational cost. In contrast, if Xi is clamped in the original model then the inference method will have to be run twice: once clamping Xi = 0, and once clamping Xi = 1, then results must be combined. This is avoided with rerooting given the symmetry of M +. Rerooting effectively replaces what may be a poor initial implicit choice of clamping at X0 with a carefully selected choice of clamping variable almost for free. This is true even for large models where it may be advantageous to clamp a series of variables: by rerooting, one of the series is obtained for free, potentially gaining signiﬁcant beneﬁt with little work required. Note that each separate connected component may be handled independently, with its own added variable. This could be useful for (repeatedly) composing clamping and then rerooting each separated component to obtain an almost free clamping in each.

Acknowledgements

We thank Aldo Pacchiano for helpful discussions, and the anonymous reviewers for helpful comments. MR acknowledges support by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis. AW acknowledges support by the Alan Turing Institute under the EPSRC grant EP/N510129/1, and by the Leverhulme Trust via the CFI.

[1] F. Barahona, M. Grötschel, M. Jünger, and G. Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493 513, 1988.

[2] M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer Publishing Company, Incorporated, 1st edition, 1997. ISBN 978-3-642-04294-2.

[3] J. Djolonga and A. Krause. Scalable variational inference in log-supermodular models. In ICML, pages 1804 1813, 2015.

[4] T. Heskes, K. Albers, and B. Kappen. Approximate inference and constrained optimization. In UAI, pages 313 320, 2003.

[5] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated protein protein interaction network: A relational Markov network approach. Journal of Computational Biology, 13(2):145 164, 2006.

[6] P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302 324, 2009.

[7] V. Kolmogorov, J. Thapper, and S. Živný. The power of linear programming for general-valued CSPs. SIAM Journal on Computing, 44(1):1 36, 2015.

[8] J. Mooij. lib DAI: A free and open source C++ library for discrete approximate inference in graphical models. Journal of Machine Learning Research, 11:2169 2173, August 2010. URL http://www.jmlr. org/papers/volume11/mooij10a/mooij10a.pdf.

[9] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[10] M. Rowland, A. Pacchiano, and A. Weller. Conditions beyond treewidth for tightness of higher-order LP relaxations. In Artiﬁcal Intelligence and Statistics (AISTATS), 2017.

[11] H. Sherali and W. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411 430, 1990.

[12] D. Sontag. Cutting plane algorithms for variational inference in graphical models. Master s thesis, MIT, EECS, 2007.

[13] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In NIPS, 2007.

[14] M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random ﬁelds. IEEE Transactions on Signal Processing, 2006.

[15] M. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008.

[16] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313 2335, 2005.

[17] A. Weller. Revisiting the limits of MAP inference by MWSS on perfect graphs. In AISTATS, 2015.

[18] A. Weller. Characterizing tightness of LP relaxations by forbidding signed minors. In UAI, 2016.

[19] A. Weller. Uprooting and rerooting graphical models. In International Conference on Machine Learning (ICML), 2016.

[20] A. Weller and J. Domke. Clamping improves TRW and mean ﬁeld approximations. In Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

[21] A. Weller and T. Jebara. Clamping variables and approximate inference. In Neural Information Processing Systems (NIPS), 2014.

[22] A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Bethe approximation: When and how can it go wrong? In Uncertainty in Artiﬁcial Intelligence (UAI), 2014.

[23] A. Weller, M. Rowland, and D. Sontag. Tightness of LP relaxations for almost balanced models. In Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

[24] J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Information Theory, pages 2282 2312, 2005.