# lifted_generalized_dual_decomposition__b755d366.pdf

Lifted Generalized Dual Decomposition

Nicholas Gallo University of California Irvine Irvine, CA 92637-3435 ngallo1@uci.edu

Alexander Ihler University of California Irvine Irvine, CA 92637-3435 ihler@ics.uci.edu

Many real-world problems, such as Markov Logic Networks (MLNs) with evidence, can be represented as a highly symmetric graphical model perturbed by additional potentials. In these models, variational inference approaches that exploit exact model symmetries are often forced to ground the entire problem, while methods that exploit approximate symmetries (such as by constructing an over-symmetric approximate model) offer no guarantees on solution quality. In this paper, we present a method based on a lifted variant of the generalized dual decomposition (Gen DD) for marginal MAP inference which provides a principled way to exploit symmetric sub-structures in a graphical model. We develop a coarse-toﬁne inference procedure that provides any-time upper bounds on the objective. The upper bound property of Gen DD provides a principled way to guide the reﬁnement process, providing good any-time performance and eventually arriving at the ground optimal solution.

Introduction A central task in many application domains (Wainwright and Jordan 2008) is computing likelihoods and marginal probabilities over distributions deﬁned by a graphical model. These tasks are, in general, intractable, and this has motivated the development of many approximate inference techniques. Of signiﬁcant theoretical and practical interest are variational techniques which formulate the inference problem as a constrained optimization problem. Within this class, methods that provide upper bounds on the partition function and give rise to convex optimization problems are particularly attractive since convergence can be guaranteed and the quality of two bounds can be easily compared. This class contains the tree-reweighted bounds (Wainwright, Jaakkola, and Willsky 2005) and primal variants such as Gen DD (Ping, Liu, and Ihler 2015), which we study here. Recently, there has been a growing interest in the development of lifted inference techniques, both exact (Bui, Huynh, and de Salvo Braz 2012; Poole 2003) and approximate (Bui, Huynh, and Riedel 2012; Mladenov, Ahmadi, and Kersting 2012; Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014; Mladenov, Globerson, and Kersting 2014), which exploit model symmetries. Most of

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

these works take a well established variational approximation and identify exact problem symmetries either algorithmically (e.g., redundant message computations of loopy belief propagation (LBP) (Singla and Domingos 2008; Kersting, Ahmadi, and Natarajan 2009)) or via graph theoretic notions of symmetry (Bui, Huynh, and Riedel 2012; Mladenov, Ahmadi, and Kersting 2012; Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014; Mladenov, Globerson, and Kersting 2014). These methods, however, often have signiﬁcant trouble dealing with models possessing highly symmetric substructures, but few exact symmetries. Some works (Venugopal and Gogate 2014; Van den Broeck and Darwiche 2013) address this by generating a symmetric approximation to an MLN (Richardson and Domingos 2006) with evidence, but offer no guarantees on the quality of the approximation. (Broeck and Niepert 2014) correct this biasedness problem by using an over-symmetric approximation to construct a proposal distribution for a Metropolis-Hastings chain, which guarantees convergence to the true marginals, but gives no guarantees on mixing time. Other works approximate lifted BP by grouping messages with similar values (Kersting et al. 2010) or satisfying a desired structural property (Singla, Nath, and Domingos 2014). (Singla, Nath, and Domingos 2014) analyzes errors the approximation induces on the true BP messages, but like BP, offers no guarantees on solution quality or convergence. Similar in spirit to our work is (Habeeb et al. 2017) which sequentially solves a set of relaxed MAP inference problems for computer vision tasks. Their method operates in a coarse to ﬁne manner solving, at each level, a MAP problem where groups of pixels are restricted to have the same value. This pixel grouping is relaxed and the ﬁner MAP problem is initialized with the solution to the coarser problem, guaranteeing monotonic improvement of the solution. Our paper provides a framework for exploiting approximate model symmetries, based on a lifted variant of Gen DD (Ping, Liu, and Ihler 2015). In contrast to works that create an over-symmetric model approximation, our method imposes a set of over-symmetric restrictions on the variational cost-shifting updates (messages) associated with the Gen DD problem. Guided by a measure of solution quality, these constraints are gradually relaxed, producing a sequence of problems of increasing accuracy, but increasing cost. Our experi-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

mental results show the superiority of the objective based reﬁnement criteria, and good anytime performance compared to methods that exploit exact symmetries.

A Markov random ﬁeld (MRF) over n discrete random variables X = [X1 . . . Xn] taking values x = [x1 . . . xn] (X 1 . . . X n) has probability density function

p(X = x; θ) = exp

α F θα(xα) Φ(θ) ,

where F represents the set of cliques of the distribution, with α F being a subset of the variables, associated with a potential table θα. Φ(θ) is the log partition function which normalizes the distribution. Furthermore, denote the set of variable indices as V = {1 . . . n}.

Generalized dual decomposition

Motivated by the intractability of computing Φ(θ), (Ping, Liu, and Ihler 2015) develops efﬁcient upper bounds on Φ(θ) via a decomposition based on H older s inequality. Their bound generalizes the dual decomposition bound (Globerson and Jaakkola 2008) for the MAP problem, maintaining two key properties: (1) it decomposes into a sum of terms deﬁned over cliques of the graphical model and (2) a family of bounds can be indexed by a set of variational parameters where ﬁnding the parameter setting yielding the tightest bound can be formulated as a convex optimization problem. We review their main results. For any non-negative function f(x) and w R>0, the power sum operator is deﬁned as

x f(x)1/w w.

The power sum reduces to standard sum when w = 1 and approaches maxx f(x) as w 0+. We also deﬁne the vector power sum w x F(x) = wk xk w1 x1 F(x), where w is a vector of k weights, x is a vector of k random variables, and F is a non-negative function with k inputs. The set of variational parameters δ = {δr α | α F, r N+ |α|} and w = {wr α | α F, r N+ |α|}, where N+ i = {1 . . . i} for any positive integer i, reparameterize the distribution to provide a tighter bound. The bound of (Ping, Liu, and Ihler 2015) can be written as

v V uv(δ, w) +

α F cα(δ, w) def= L(δ, w). (1)

where, letting wα = [w1 α . . . w|α| α ], the clique terms are

cα(δ, w) = log

xα exp θα(xα)

r=1 δr α(xαr) (2)

Figure 1: Position explicit factor graph associated with α1 = (x1, x2), α2 = (x2, x3). Factor cells (αi, r) are associated with parameters (δr αi, wr αi) and unary terms are sums of neighboring cost shifting variables: δ0 1 = δ1 α1, δ0 2 = δ2 α1 + δ1 α2, δ0 3 = δ2 α2 (similarly for the weights).

and the unary terms are

uv(δ, w) = log

xv exp(δ0 v(xv)), (3)

with δ0 v(xv) =

(α,r) Nv δr α(xv) and wΔ v = 1

(α,r) Nv wr α where Nv = {(α, r) | αr = v} are the neighbors of v. Furthermore, we deﬁne w0 v = max(0, wΔ v ) and require wr α 0 for all (α, r).1 We also require that the local elimination order of each clique power sum term (2) be consistent with a global elimination order o, meaning that for all i < j, o(αi) < o(αj). If this is violated for an α in the input, that α and its associated potential are permuted to obey this order.

Ground factor graphs In order to facilitate development of our lifted inference procedure, we modify the standard factor graph depiction of graphical models to make each variable s position in its participating factors explicit. To this end, we draw a circular node associated with each variable v and a rectangle partitioned into |α| cells associated with each factor α. Position cell (α, r) is associated with terms (δr α, wr α) (labeled with just δr α for compactness) and has an edge with its variable neighbor v = αr, which is associated with terms δ0 v,wΔ v ,uv (labeled with just δ0 v). A simple example is illustrated in ﬁgure 1.2

Exploiting symmetries Lifted inference techniques identify a set of symmetries in the ground variational problem that can be exploited computationally. Exact lifted variational inference (Singla and Domingos 2008; Mladenov, Globerson, and Kersting 2014; Mladenov and Kersting 2015; Mladenov, Ahmadi, and Kersting 2012) identiﬁes a stable partition (Berkholz, Bonsma, and Grohe 2013) of the graph as sufﬁcient to characterize symmetries in the solution to the ground variational problem. In this paper, we impose a set of over-symmetric constraints on the ground variational parameters (δ, w), inducing symmetries in the objective (1) that will, in general, be

1In (Ping, Liu, and Ihler 2015), w0 v is a free variable with restrictions w0 v 0 and w0 v +

(α,r) Nv wr α = 1. In our formulation, this sum is always larger than or equal to 1 (providing a valid bound), with equality holding at the optimum. 2(Mladenov and Kersting 2015) used a similar representation where each position cell is its own node, connected to its factor node. For compactness, we prefer the notation presented here.

coarser than those implied by a stable partition, yielding looser but computationally cheaper bounds.

Parameter and objective symmetries Consider a disjoint partition PF = {F1 . . . FF } of the ground factors F (Fi Fj = for i = j, f Ff = F) where all factors in the same partition are required to have the same potential function. That is, for each f there exists a representative factor θf (with associated scope size rf and domains X r f for r N+ rf ) such that for all α Ff, θα = θf (and |α| = rf and Xαr = X r f for r N+ rr). In this paper, we consider over-symmetric constraints on the variational parameters which forces terms associated with factors in the same partition to be identical. That is, we consider parameters in the set

S(PF) = (δ, w) | ( δ, w) Ff PF, α Ff , r N+ |α| δr α( xr f )= δr f ( xr f ), wr α= wr f

where δ = { δr f | Ff PF, r N+ rf } and w = { wr f | Ff PF, r N+ rf } are restricted sets of parameters. For any (δ, w) S(PF), we identify symmetries in the clique terms (2) as cα(δ, w) = cf( δ, w) for all α Ff where

cf( δ, w) def= log

x exp θf( x)

r=1 δr f( xr f) (4)

and wf = [ w1 f . . . w rf f ] and x = [ x1 f . . . x rf f ]. To recognize the symmetries induced in the aggregated cost shifting δ0 v, w0 v (and hence uv (3)) terms, ﬁrst let N (f,r) v = {α | α Ff, (α, r) Nv} represent the neighbors of v which contribute δr α( xr f) = δr f( xr f) to the sum.

There are M (f,r) v = |N (f,r) v | such neighbors. Furthermore, deﬁne a disjoint partition PV = {V1 . . . VK} of V where all variables in the same partition have the same domain (Xv = Xk for all v Vk) and the same neighbor counts given PF . That is, there exist representative counts M (f,r) k such that for M (f,r) v = M (f,r) k for all v Vk and all (f, r). Now, for all v Vk we have

δ0 v( xk) = δ0 k( xk) def=

M (f,r) k δr f( xk) (5)

for all values xk and Nk = {(f, r) | M (f,r) k > 0} (and similarly, wΔ v = wΔ k def= 1

(f,r) Nk M (f,r) k wr f). These symmetries imply that uv(δ, w) = uk( δ, w) for all v Vk. Our objective can now be evaluated over the smaller set of representative parameters ( δ, w), with symmetries fully speciﬁed by P = (PF, PV). We say that such P is valid and we have L(δ, w) = L( δ, w) where

L( δ, w) def=

Ff PF |Ff| cf( δ, w) +

Vk PV |Vk| uk( δ, w).

Lifted factor graphs Associated with any valid partition P is a lifted graph which compactly illustrates the symmetries in the variational parameters and problem terms (1). A lifted factor graph has a super-node associated with each Vk and a super-factor with rf position cells associated with each Ff. Position cell (f, r) is associated with δr f, wr f (la-

beled with just δr f) and has a super-edge labeled M (f,r) k with super-node k if M (f,r) k > 0. The term Nk deﬁned after (5) can be seen as the neighborhood of k in the lifted graph. We similarly deﬁne the neighborhood of (f, r) as N (f,r) = {k | M (f,r) k > 0}. Example 1. Consider a model with factor graph depicted in ﬁgure 2a (with elimination order o = (R1, R2, T1, T2)) and clique symmetries θα2 = θα3 = θα4 = θα5 = θ. Associated with the graph is the partition PF = {Fg, Fy} where Fg = {α1} (green), Fy = {α2, α3, α4, α5} (yellow) and PV = {Vp, Vr, Vb} where Vp = {R1} (purple), Vr = {R2} (red), and Vb = {T1, T2} (blue).3 The top half of the middle panel indicates the oversymmetric parameter constrains speciﬁed by PF. The bottom half of that panel indicates the induced symmetries in the unary terms. For example, by examining the neighborhood of R1 in the factor graph, we see that δ0 R1 = δ1 α2+δ1 α3+ δ1 α1. Via the parameter symmetries δ1 α2 = δ1 α3 = δ1 y and δ1 α1 = δ1 g, we see that this is equivalent to δ0 R1 = 2 δ1 y + δ1 g.

Gradient symmetries Just as the over-symmetric parameter constraints induce symmetries in the ground objective function, they also induce (a different set of) symmetries in the gradient of the ground objective. These gradient symmetries will be used to derive the gradient of the lifted problem, to identify exact symmetries in the ground problem, and to guide our coarseto-ﬁne inference procedure. First, notice that the parameter symmetries imply not only symmetries in the values of the clique (4) and unary (5) terms, but also in their gradients. That is, for any (δ, w) S(PF), we have cα δrα( xr f) = cf δr f( xr f) and uv δ0v( xk) = uk δ0 k( xk)

for any α Ff, v Vk, and for all values of xr f and xk.

Now, consider the set E(f,r) k = v Vk N (f,r) v of ground factors α Ff whose rth position cell (α, r) has neighbor v = αr Vk. For any α E(f,r) k , the parameter δr α appears only in clique term cα in position r (multiplied by -1) and unary term uv (multiplied by +1). We can therefore identify symmetries in the objective gradient as L δrα( xk) = uv δ0v( xk) cf δrα( xk)

= uk δ0 k( xk) cf δr f( xk)

def= g(f,r) k;δ ( xk) (7)

3The main text used integers to index factor and variable partitions; here we use letters, representing ﬁgure colors, for clarity.

(a) Valid, non-stable P, speciﬁes over-symmetric constraints on the variational parameters of the ground problem.

(b) Stable P, speciﬁes symmetries in variational parameters that yield an exact solution to the ground problem.

Figure 2: (a) valid, non-stable partition imposes over-symmetric constraints, yielding a looser bound. (b) Stable partition which characterizes exact symmetries of ground problem. More details in examples 1 and 2 of the text.

for all values of xk (and similarly for the w gradients).

Lifted gradients and ground optimality The gradients of the lifted objective (6) can be written using the gradient symmetries of the ground problem. Specifying the lifted gradients in this way allows us to identify partitions whose symmetric parameter constraints specify an exact solution to the ground problem. Since, via (7), δr α( xr f) = δr f( xr f) for all α Ff (and all values of xr f), we can write

L δr f( xr f) =

L δrα( xr f) =

k N (f,r) |E(f,r) k | g(f,r) k;δ ( xr f)

(8) where |E(f,r) k | = M (f,r) k |Vk| (and similarly for the w gradients). In the special case where (f, r) has one lifted neighbor N (f,r) = {k}, Ff = E(f,r) k and (8) simpliﬁes to

L δr f( xr f) = |Ff| L δrα( xr f),

meaning that L/ δr α( xr f) = 0 when L/ δr f( xr f) = 0. Now, if all of the gradients of the lifted problem are zero, and its lifted graph has | N (f,r)| = 1 for all (f, r), then all of the ground gradients are also zero. This means that a solution of the lifted problem provides a solution to the ground problem. A partition P whose lifted graph satisﬁes the property that | N (f,r)| = 1 for all (f, r) is called stable.

Example 2. If we replace PF from example 1 with F = {Fg, Fy, Fo} where Fg = {α1} (green), Fy = {α2, α3}

(yellow), Fo = {α4, α5} (orange), then the partition is stable and S(PF) contains a solution to the ground variational problem. This is illustrated in ﬁgure 2b.

Optimizing the lifted objective

In this paper, we use a black-box procedure to perform global optimization utilizing the gradient of L. To ensure positivity of the weights, we deﬁne the parameters η = { ηr f | Ff PF, r N+ rf }, substituting wr f with ϵ+exp( ηr f) (and L/ ηr f = ( L/ wr f) exp( ηr f)). To ensure that the function is differentiable, we set w0 k = a( wΔ k ) where a(x) = ϵ+t log(1+exp(x/t)).4 This enables unconstrained optimization over the η and δ parameters. The gradients are straightforward to derive from the gradients in (Ping, Liu, and Ihler 2015) and are omitted for space.

Coarse to ﬁne approximation

We now develop a coarse to ﬁne procedure that interleaves relaxations of the over-symmetric parameter constraints S(PF) with optimization, eventually arriving at the coarsest stable partition P and a solution to the ground problem. While optimization touches only the lifted graph, relaxation steps must touch a subset of the ground factors and variables. To control this cost, we adapt the key ideas from (Berkholz, Bonsma, and Grohe 2013), which presented an asymptotically optimal algorithm for generating the coarsest stable partition of a graph, to our work. This procedure is

4Any t > 0, ϵ > 0 bound the objective since it generates larger unary weights: at(wΔ v ) > ϵ + max(0, wΔ v ) > max(0, wΔ v ).

Algorithm 1 Syntactic coarsest stable partition

1: Set PV to group variables with identical Xv 2: Set F0 f for f N+ F 0 to group factors with identical θ 3: Q ; 4: for F = 1 to F 0 do 5: FF F0 f ; reﬁne PV(0); 6: end for 7: while Q = do 8: Choose any (f, r) Q, set Q Q \ (f, r)

9: Choose a k N (f,r) where |E(f,r) k | 1

2|Ff| 10: F F + 1; 11: FF E(f,r) k ; Ff Ff \ FF ; 12: reﬁne PV(f); 13: end while

used as a pre-processing step in works on exact lifted variational inference such as (Mladenov, Globerson, and Kersting 2014).

Syntactic coarsest stable partition

Throughout our description, we maintain a valid partition and its associated lifted graph. Recall from the subsection on lifted gradients (and Figure 2b) that a valid partition is stable if and only if each position cell in its lifted graph has exactly one neighbor. Algorithm 1 maintains the set Q = {(f, r) | | N (f,r)| 2} of position cells that violate this property. The algorithm iteratively selects an (f, r) Q and one of its neighbors k N (f,r). Ground factors in Ff are then split into two groups (line 11), based on whether their rth position neighbor is a variable in Vk (recall from the section on gradient symmetries that this is E(f,r) k ), or not. In the appendix, we deﬁne the function reﬁne PV that updates the partition PV to be consistent with the new neighbor counts M f v and M F v for all v, updating the lifted graph and Q as needed. The restriction of choosing k such that |E(f,r) k | 1

2 |Ff| (line 9) is necessary to maintain the complexity results of (Berkholz, Bonsma, and Grohe 2013), as discussed in the appendix.

Coarse-to-ﬁne algorithm

One simple way to construct a coarse to ﬁne inference procedure is to interrupt the main loop of Algorithm 1 to perform lifted inference with relaxations and symmetries speciﬁed by the current partition P. This would yield any-time bounds and ensure that we never produce a model ﬁner than the coarsest stable partition. In practice, however, superior results are obtained by choosing reﬁnement operations based on their predicted decrease of the objective value. Algorithm 2 illustrates this idea, where Sr f (detailed in next section) judges the quality of a split of (f, r), cost(P) is the estimated cost of one inference iteration (we count the number of operations used to compute the lifted score (6)), and β controls how much reﬁnement we allow between inference calls.

Algorithm 2 Coarse to ﬁne inference

1: Execute lines 1 to 6 of Algorithm 1 Initialize P 2: Ψ 0; 3: repeat 4: if cost(P) > Ψ then 5: [ δ, w] = inference(P, δ, w) 6: Set Sr f via (9) (f, r) Q 7: Ψ cost(P) β; 8: end if 9: (f , r ) arg min(f,r) Q Sr f 10: Set N 1 N (f ,r ) as arg min of Sr f via (9) 11: F F + 1; 12: FF k N 1E(f ,r ) k ; Ff Ff \ FF ; 13: reﬁne PV(f ) 14: Set Sr f via (9) for f {f , F}, r N+ rf 15: until P is stable 16: [ δ, w] = inference(P, δ, w) Stable P inference

Objective based scoring function We ﬁrst deﬁne a metric to measure the quality of a proposed parameter partition, then we will show how to optimize this metric efﬁciently, using only information contained in the lifted graph. For the set of ground parameters associated with an (f, r), we measure the error induced by the current parameter tying restrictions as the distance of the ground gradients to their projected gradient. Letting gr α be the set of ground gradients associated with an (α, r), this metric is d(Ff, r) = α Ff ||gr α γ||2 2, with γ = (

α Ff gr α)/|Ff|. We would like to ﬁnd a split of Ff into Ff 1 and Ff 2 that minimizes d(Ff 1, r) + d(Ff 2, r) d(Ff, r), which is the change in the projected gradient distance. By symmetry, this split groups together factors with the same gr α. Recalling from (7) that gr α = g(f,r) k for all α E(f,r) k , we write this grouping as Ff i = k N i E(f,r) k where N 1 N (f,r) and N 2 = N (f,r) \ N 1 (adopting the convention |Ff 1| |Ff 2|). We can now form a compact optimization problem to ﬁnd the best split. For any N N (f,r) we have d( k N E(f,r) k , r) = dr f( N ) where

dr f( N ) =

k N |E(f,r) k | || g(f,r) k γ||2 2

k N |E(f,r) k | g(f,r) k )/(

k N |E(f,r) k |). Letting P( N (f,r)) be the power set (all subsets) of N (f,r), we solve

Sr f = min N 1 P( N (f,r))

i=1 dr f( N i) dr f( N (f,r)). (9)

This is a weighted K-Means (K=2) problem with data g(f,r) k and weights |E(f,r) k |. If we approximate this by using only one of the elements of g(f,r) k (for example only the gradient wrt δ(f,r) k (1), as we use on binary problems in this paper s experiments), the optimum can be found exactly by sorting g(f,r) k and looping over all splits.

Weight Formula b ( x = y) V (x) V (y) ux ( x) V (x)

(a) Complete Graph

Weight Formula b ( x = y) L(x, y) (C(y) C(x)) ux ( x) C(x) lxy ( x = y) L(x, y)

(b) Binary Collective Classiﬁcation

Weight Formula b ( x = y) Q1(x) Q2(y) b ( x = y) Q2(x) Q3(y) b ( x = y) Q3(x) Q1(y) ui x ( x) Qi(x), i {1, 2, 3}

(c) Clique-Cycle

Figure 3: Test models with distinct soft evidence (u and l terms).

10-3 10-2 10-1 100 101

Inference time (seconds)

Upper bound

C2F - Objective C2F - Size C2F - Syntactic Stable

(a) Complete

10-2 10-1 100 101

Inference time (seconds)

Upper bound

C2F - Objective C2F - Size C2F - Syntactic Stable

(b) Collective Classiﬁcation

10-3 10-2 10-1 100 101

Inference time (seconds)

Upper bound

C2F - Objective C2F - Size C2F - Syntactic Stable

(c) Clique Cycle

Figure 4: Comparison of optimization at stable coloring (exact model symmetries) with our coarse to ﬁne inference framework for three different splitting methods, on each of the test models with d = 160 domain size for the logical variables. Vertical black dashed lines indicate coarse-to-ﬁne transitions for objective based splitting ( C2F - Objective curve). Transitions for other C2F curves occur at approximately the same positions.

Experiments This section provides an empirical illustration of our lifted Gen DD algorithm. We demonstrate the superiority of our objective based approach over purely syntactic based approaches for generating a coarse to ﬁne approximation sequence. We also demonstrate excellent any-time performance on MLNs with distinct soft evidence on every node compared to exact lifted variational approaches which are forced to ground the entire problem.

Datasets and methodology Datasets. Figure 3 shows the models we use (similar models without evidence were used in (Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014) to evaluate performance on marginalization tasks). In all models b was set to 2.5 and the u terms specify random evidence uniformly distributed on [ ζ, ζ] where ζ = log((1 10 2)/10 2), so that the u terms act like probabilities between 0.01 and 0.99. For the collective classiﬁcation problem, we also choose a random 10% of the x, y values associated with links L(x, y) and set lxy to a random value on [0, 5.0]. This can be interpreted as the strength of a link between two web pages. Parameter settings. In our experiments, we set ϵ = 10 3, t = 10 2 (in the deﬁnition of a(x)) and perform an LBFGS black box optimization with rank 20 Hessian correction. For the coarse to ﬁne procedure, it is important to balance performing inference with model reﬁnement. We found it worked well to perform a small number of inference iterations (30), followed by a small amount of model reﬁnement (setting β = 1.25 in Algorithm 2). This worked better than trying to diagnose inference convergence to determine the

transition point. Future work will look at more principled methods of balancing inference work with reﬁnement. Timing methodology. In our experiments, we measure only the cost of performing inference, ignoring the cost of identifying symmetries. This is common in experiments on approximate lifted inference in other works (Mladenov, Globerson, and Kersting 2014; Broeck and Niepert 2014; Venugopal and Gogate 2014) as well as exact lifted variational inference approaches (Bui, Huynh, and Riedel 2012; Bui, Huynh, and Sontag 2014; Mladenov and Kersting 2015) which assume problem symmetries are available before inference. Works such as (Singla, Nath, and Domingos 2014; Kersting et al. 2010) note that identifying (approximate) symmetries can be a bottleneck. (Singla, Nath, and Domingos 2014) used a sparse hyper-cube based representation to specify approximate symmetries in MLNs; adapting lifted Gen DD to leverage a similar representation is an interesting direction for future work.

Results Figure 4 compares our coarse to ﬁne procedure vs. performing inference on the stable coloring (ground model in our setup) on each of our three test models. The blue curve ( C2F - objective ) performs objective-based splitting as per Algorithm 2 using the metric (9) to guide splitting choice. We also show syntactic splitting ( C2F - syntactic , orange) as per Algorithm 1, interleaving inference when the predicted inference cost exceeds a cost bound (then multiplying that bound by β as in Algorithm 2). C2F - Size (red) splits the largest super-factor in Q and chooses a random split that is as even as possible (adding random edges from N (f,r) to

10-3 10-2 10-1

Inference time (seconds)

Upper bound

C2F - Objective Stable

10-3 10-2 10-1

Inference time (seconds)

Upper bound

10-3 10-2 10-1 100 101

Inference time (seconds)

Upper bound

10-3 10-2 10-1 100 101

Inference time (seconds)

Upper bound

(d) d = 160

10-2 100 102

Inference time (seconds)

Upper bound

(e) d = 320

Figure 5: Scalability comparison. Panels (a)-(e) show bound vs time on instances of complete graph with evidence, varying logical variable domain size d (in MLN speciﬁcation). C2F exhibits increasing advantage as model grows larger.

N 2, stopping when

k N 1 |E(f,r) k | > 1

2 |Ff| and setting N 1 = N (f,r) \ N 2 in line 10 of Algorithm 2), interleaving inference with reﬁnement as in the other C2F methods. The stable (here, ground) inference process is shown in green. We see that objective-based reﬁnement signiﬁcantly outperforms the others, with size-based reﬁnement worse, but still better than syntactic splitting. Objective-based splitting provides orders of magnitude speed-up over the stable coloring for similar quality results, and all methods provide better early performance than the stable coloring, which must touch the entire model to even provide an initial bound. Figure 5 demonstrates the scalability of our approach. Plots (a)-(e) show the running time for our coarse-to-ﬁne ( C2F - Objective (blue)) method and the stable coloring method on instances of the complete graph, varying its size.

Conclusions

We presented a framework for lifting Gen DD for approximate inference by imposing over-symmetric constraints on the optimization parameters to induce symmetry in the variational bound. We developed a coarse-to-ﬁne procedure that, guided by the objective, provides high-quality any-time approximations even in models with no exact symmetry. Our method provides orders of magnitude speed-up on our benchmarks, with increasing advantage for larger models.

Acknowledgements

This work was sponsored in part by NSF grant IIS-1254071.

Description of reﬁne PV. After creating a new partition FF (possibly via splitting it from Ff), algorithms 1 and 2 call reﬁne PV to update PV such that variables having identical neighbor counts Mv = [M 1 v , . . . , M F v ] are grouped together (where

M f v def= [M (f ,1) v . . . M (f , rf ) v ] for all f ). We assume that at entry, the neighbor counts M f v are correct for all v and f = 1 . . . F 1, and that PV correctly groups variables by these counts, i.e., M f v = M f

k for all v Vk. The ﬁrst loop updates N (f,r) v and N (F,r) v and their counts. The second loop groups variables by their signature [Kv, M (F,r) v ], creating the sets Vm k = {v | Kv =

Algorithm 3 reﬁne PV Input: f: if f = 0, FF was deleted from Ff before call. 1: for r = 1 to r F do 2: for α FF do Update ground neighbor lists 3: v = αr; 4: N (f,r) v N (f,r) v \ α; N (F,r) v N (F,r) v α;

5: M (f,r) v M (f,r) v + 1; M (F,r) v M (F,r) v 1; 6: end for 7: Vm k ( k, m) New variable groupings 8: for α FF do 9: v = αr; k Kv; m = M (F,r) v ; 10: Vk Vk \ v; Vm k Vm k v; 11: end for 12: for (k, m) : Vm k = do Update lifted graph 13: K k; 14: if m = last m(k) then New super-node 15: K K + 1; K K

16: M (f ,r ) K M (f ,r ) k (f , r ) Nk 17: end if 18: VK Vm k ; Kv K ( v VK )

19: M (F,r) K m; M (f,r) K M (f,r) K m; 20: end for 21: end for Note: Updating M (f ,r ) k (for any k , f , r ) changes Nk , N (f ,r ), Q. For compactness, these changes are not shown.

k, m = M (F,r) v }, where Kv is the partition holding variable v (v Vk with k = Kv). This produces the same grouping as would be obtained by grouping the full signature [M 1 v . . . M (F 1) v , M F v ] because only M (f,r) v and M (F,r) v counts change during the call. Note that the change to M (f,r) v does not affect our grouping since for all v Vm k , M (f,r) v is equal to its value before the call minus M (F,r) v (the updates to M (f,r) K on line 19 reﬂect this fact).

Lastly, the third loop adds Vm k sets to PV and updates quantities associated with the lifted graph ( M, N terms). Vk will be over-written with Vm k where m is the last (k, m) encountered in the loop. In all other cases, k splits into a new super-node and its neighbors are copied (line 16).

Complexity analysis Each set is implemented as a hash table supporting O(1) addition and deletion (N (f ,r ) v stores pointers to factor scope vectors). Hence, each α FF contributes O(|α|) time to the ﬁrst two loops of reﬁne PV. Throughout algorithms 1 and 2, an α is moved to an FF that is at most half the size of its current set. Thus, each α participates in at most log2 |F| sets throughout either algorithm. Equivalent to the result of (Berkholz, Bonsma, and Grohe 2013), the total time of the these loops is O(R |F| log |F|) where R = maxα F |α|. Throughout algorithms 1 and 2, the total time spent in the third loop of reﬁne PV is proportional to the size of the lifted graph. This is because the loop adds nodes and edges to the lifted graph in O(1) and each addition is accompanied by at most one deletion (if M (f,r) K = 0 in line 19). Since the size of the lifted graph is at most the size of the ground graph (R |F|), the total time is dominated by the ﬁrst two loops.

Berkholz, C.; Bonsma, P.; and Grohe, M. 2013. Tight lower and upper bounds for the complexity of canonical colour reﬁnement. In European Symposium on Algorithms, 145 156. Springer. Broeck, G. V. d., and Niepert, M. 2014. Lifted probabilistic inference for asymmetric graphical models. ar Xiv preprint ar Xiv:1412.0315. Bui, H. B.; Huynh, T. N.; and de Salvo Braz, R. 2012. Exact lifted inference with distinct soft evidence on every object. In AAAI. Bui, H. H.; Huynh, T. N.; and Riedel, S. 2012. Automorphism groups of graphical models and lifted variational inference. ar Xiv preprint ar Xiv:1207.4814. Bui, H. H.; Huynh, T. N.; and Sontag, D. 2014. Lifted tree-reweighted variational inference. ar Xiv preprint ar Xiv:1406.4200. Globerson, A., and Jaakkola, T. S. 2008. Fixing maxproduct: Convergent message passing algorithms for map lp-relaxations. In Advances in neural information processing systems, 553 560. Habeeb, H.; Anand, A.; Singla, P.; et al. 2017. Coarse-toﬁne lifted map inference in computer vision. ar Xiv preprint ar Xiv:1707.07165. Kersting, K.; Ahmadi, B.; and Natarajan, S. 2009. Counting belief propagation. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 277 284. AUAI Press. Kersting, K.; El Massaoudi, Y.; Hadiji, F.; and Ahmadi, B. 2010. Informed lifting for message-passing. In AAAI. Mladenov, M.; Ahmadi, B.; and Kersting, K. 2012. Lifted linear programming. In AISTATS, 788 797. Mladenov, M., and Kersting, K. 2015. Equitable partitions of concave free energies. In UAI, 602 611. Mladenov, M.; Globerson, A.; and Kersting, K. 2014. Lifted message passing as reparametrization of graphical models. In UAI, 603 612.

Ping, W.; Liu, Q.; and Ihler, A. T. 2015. Decomposition bounds for marginal map. In Advances in Neural Information Processing Systems, 3267 3275. Poole, D. 2003. First-order probabilistic inference. In IJCAI, volume 3, 985 991. Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine learning 62(1):107 136. Singla, P., and Domingos, P. M. 2008. Lifted ﬁrst-order belief propagation. In AAAI, volume 8, 1094 1099. Singla, P.; Nath, A.; and Domingos, P. M. 2014. Approximate lifting techniques for belief propagation. In AAAI, 2497 2504. Van den Broeck, G., and Darwiche, A. 2013. On the complexity and approximation of binary evidence in lifted inference. In Advances in Neural Information Processing Systems, 2868 2876. Venugopal, D., and Gogate, V. 2014. Evidence-based clustering for scalable inference in markov logic. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 258 273. Springer. Wainwright, M. J., and Jordan, M. I. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning 1(1 2):1 305. Wainwright, M. J.; Jaakkola, T. S.; and Willsky, A. S. 2005. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory 51(7):2313 2335.