# lifted_generalized_dual_decomposition__b755d366.pdf Lifted Generalized Dual Decomposition Nicholas Gallo University of California Irvine Irvine, CA 92637-3435 ngallo1@uci.edu Alexander Ihler University of California Irvine Irvine, CA 92637-3435 ihler@ics.uci.edu Many real-world problems, such as Markov Logic Networks (MLNs) with evidence, can be represented as a highly symmetric graphical model perturbed by additional potentials. In these models, variational inference approaches that exploit exact model symmetries are often forced to ground the entire problem, while methods that exploit approximate symmetries (such as by constructing an over-symmetric approximate model) offer no guarantees on solution quality. In this paper, we present a method based on a lifted variant of the generalized dual decomposition (Gen DD) for marginal MAP inference which provides a principled way to exploit symmetric sub-structures in a graphical model. We develop a coarse-tofine inference procedure that provides any-time upper bounds on the objective. The upper bound property of Gen DD provides a principled way to guide the refinement process, providing good any-time performance and eventually arriving at the ground optimal solution. Introduction A central task in many application domains (Wainwright and Jordan 2008) is computing likelihoods and marginal probabilities over distributions defined by a graphical model. These tasks are, in general, intractable, and this has motivated the development of many approximate inference techniques. Of significant theoretical and practical interest are variational techniques which formulate the inference problem as a constrained optimization problem. Within this class, methods that provide upper bounds on the partition function and give rise to convex optimization problems are particularly attractive since convergence can be guaranteed and the quality of two bounds can be easily compared. This class contains the tree-reweighted bounds (Wainwright, Jaakkola, and Willsky 2005) and primal variants such as Gen DD (Ping, Liu, and Ihler 2015), which we study here. Recently, there has been a growing interest in the development of lifted inference techniques, both exact (Bui, Huynh, and de Salvo Braz 2012; Poole 2003) and approximate (Bui, Huynh, and Riedel 2012; Mladenov, Ahmadi, and Kersting 2012; Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014; Mladenov, Globerson, and Kersting 2014), which exploit model symmetries. Most of Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. these works take a well established variational approximation and identify exact problem symmetries either algorithmically (e.g., redundant message computations of loopy belief propagation (LBP) (Singla and Domingos 2008; Kersting, Ahmadi, and Natarajan 2009)) or via graph theoretic notions of symmetry (Bui, Huynh, and Riedel 2012; Mladenov, Ahmadi, and Kersting 2012; Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014; Mladenov, Globerson, and Kersting 2014). These methods, however, often have significant trouble dealing with models possessing highly symmetric substructures, but few exact symmetries. Some works (Venugopal and Gogate 2014; Van den Broeck and Darwiche 2013) address this by generating a symmetric approximation to an MLN (Richardson and Domingos 2006) with evidence, but offer no guarantees on the quality of the approximation. (Broeck and Niepert 2014) correct this biasedness problem by using an over-symmetric approximation to construct a proposal distribution for a Metropolis-Hastings chain, which guarantees convergence to the true marginals, but gives no guarantees on mixing time. Other works approximate lifted BP by grouping messages with similar values (Kersting et al. 2010) or satisfying a desired structural property (Singla, Nath, and Domingos 2014). (Singla, Nath, and Domingos 2014) analyzes errors the approximation induces on the true BP messages, but like BP, offers no guarantees on solution quality or convergence. Similar in spirit to our work is (Habeeb et al. 2017) which sequentially solves a set of relaxed MAP inference problems for computer vision tasks. Their method operates in a coarse to fine manner solving, at each level, a MAP problem where groups of pixels are restricted to have the same value. This pixel grouping is relaxed and the finer MAP problem is initialized with the solution to the coarser problem, guaranteeing monotonic improvement of the solution. Our paper provides a framework for exploiting approximate model symmetries, based on a lifted variant of Gen DD (Ping, Liu, and Ihler 2015). In contrast to works that create an over-symmetric model approximation, our method imposes a set of over-symmetric restrictions on the variational cost-shifting updates (messages) associated with the Gen DD problem. Guided by a measure of solution quality, these constraints are gradually relaxed, producing a sequence of problems of increasing accuracy, but increasing cost. Our experi- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) mental results show the superiority of the objective based refinement criteria, and good anytime performance compared to methods that exploit exact symmetries. A Markov random field (MRF) over n discrete random variables X = [X1 . . . Xn] taking values x = [x1 . . . xn] (X 1 . . . X n) has probability density function p(X = x; θ) = exp α F θα(xα) Φ(θ) , where F represents the set of cliques of the distribution, with α F being a subset of the variables, associated with a potential table θα. Φ(θ) is the log partition function which normalizes the distribution. Furthermore, denote the set of variable indices as V = {1 . . . n}. Generalized dual decomposition Motivated by the intractability of computing Φ(θ), (Ping, Liu, and Ihler 2015) develops efficient upper bounds on Φ(θ) via a decomposition based on H older s inequality. Their bound generalizes the dual decomposition bound (Globerson and Jaakkola 2008) for the MAP problem, maintaining two key properties: (1) it decomposes into a sum of terms defined over cliques of the graphical model and (2) a family of bounds can be indexed by a set of variational parameters where finding the parameter setting yielding the tightest bound can be formulated as a convex optimization problem. We review their main results. For any non-negative function f(x) and w R>0, the power sum operator is defined as x f(x)1/w w. The power sum reduces to standard sum when w = 1 and approaches maxx f(x) as w 0+. We also define the vector power sum w x F(x) = wk xk w1 x1 F(x), where w is a vector of k weights, x is a vector of k random variables, and F is a non-negative function with k inputs. The set of variational parameters δ = {δr α | α F, r N+ |α|} and w = {wr α | α F, r N+ |α|}, where N+ i = {1 . . . i} for any positive integer i, reparameterize the distribution to provide a tighter bound. The bound of (Ping, Liu, and Ihler 2015) can be written as v V uv(δ, w) + α F cα(δ, w) def= L(δ, w). (1) where, letting wα = [w1 α . . . w|α| α ], the clique terms are cα(δ, w) = log xα exp θα(xα) r=1 δr α(xαr) (2) Figure 1: Position explicit factor graph associated with α1 = (x1, x2), α2 = (x2, x3). Factor cells (αi, r) are associated with parameters (δr αi, wr αi) and unary terms are sums of neighboring cost shifting variables: δ0 1 = δ1 α1, δ0 2 = δ2 α1 + δ1 α2, δ0 3 = δ2 α2 (similarly for the weights). and the unary terms are uv(δ, w) = log xv exp(δ0 v(xv)), (3) with δ0 v(xv) = (α,r) Nv δr α(xv) and wΔ v = 1 (α,r) Nv wr α where Nv = {(α, r) | αr = v} are the neighbors of v. Furthermore, we define w0 v = max(0, wΔ v ) and require wr α 0 for all (α, r).1 We also require that the local elimination order of each clique power sum term (2) be consistent with a global elimination order o, meaning that for all i < j, o(αi) < o(αj). If this is violated for an α in the input, that α and its associated potential are permuted to obey this order. Ground factor graphs In order to facilitate development of our lifted inference procedure, we modify the standard factor graph depiction of graphical models to make each variable s position in its participating factors explicit. To this end, we draw a circular node associated with each variable v and a rectangle partitioned into |α| cells associated with each factor α. Position cell (α, r) is associated with terms (δr α, wr α) (labeled with just δr α for compactness) and has an edge with its variable neighbor v = αr, which is associated with terms δ0 v,wΔ v ,uv (labeled with just δ0 v). A simple example is illustrated in figure 1.2 Exploiting symmetries Lifted inference techniques identify a set of symmetries in the ground variational problem that can be exploited computationally. Exact lifted variational inference (Singla and Domingos 2008; Mladenov, Globerson, and Kersting 2014; Mladenov and Kersting 2015; Mladenov, Ahmadi, and Kersting 2012) identifies a stable partition (Berkholz, Bonsma, and Grohe 2013) of the graph as sufficient to characterize symmetries in the solution to the ground variational problem. In this paper, we impose a set of over-symmetric constraints on the ground variational parameters (δ, w), inducing symmetries in the objective (1) that will, in general, be 1In (Ping, Liu, and Ihler 2015), w0 v is a free variable with restrictions w0 v 0 and w0 v + (α,r) Nv wr α = 1. In our formulation, this sum is always larger than or equal to 1 (providing a valid bound), with equality holding at the optimum. 2(Mladenov and Kersting 2015) used a similar representation where each position cell is its own node, connected to its factor node. For compactness, we prefer the notation presented here. coarser than those implied by a stable partition, yielding looser but computationally cheaper bounds. Parameter and objective symmetries Consider a disjoint partition PF = {F1 . . . FF } of the ground factors F (Fi Fj = for i = j, f Ff = F) where all factors in the same partition are required to have the same potential function. That is, for each f there exists a representative factor θf (with associated scope size rf and domains X r f for r N+ rf ) such that for all α Ff, θα = θf (and |α| = rf and Xαr = X r f for r N+ rr). In this paper, we consider over-symmetric constraints on the variational parameters which forces terms associated with factors in the same partition to be identical. That is, we consider parameters in the set S(PF) = (δ, w) | ( δ, w) Ff PF, α Ff , r N+ |α| δr α( xr f )= δr f ( xr f ), wr α= wr f where δ = { δr f | Ff PF, r N+ rf } and w = { wr f | Ff PF, r N+ rf } are restricted sets of parameters. For any (δ, w) S(PF), we identify symmetries in the clique terms (2) as cα(δ, w) = cf( δ, w) for all α Ff where cf( δ, w) def= log x exp θf( x) r=1 δr f( xr f) (4) and wf = [ w1 f . . . w rf f ] and x = [ x1 f . . . x rf f ]. To recognize the symmetries induced in the aggregated cost shifting δ0 v, w0 v (and hence uv (3)) terms, first let N (f,r) v = {α | α Ff, (α, r) Nv} represent the neighbors of v which contribute δr α( xr f) = δr f( xr f) to the sum. There are M (f,r) v = |N (f,r) v | such neighbors. Furthermore, define a disjoint partition PV = {V1 . . . VK} of V where all variables in the same partition have the same domain (Xv = Xk for all v Vk) and the same neighbor counts given PF . That is, there exist representative counts M (f,r) k such that for M (f,r) v = M (f,r) k for all v Vk and all (f, r). Now, for all v Vk we have δ0 v( xk) = δ0 k( xk) def= M (f,r) k δr f( xk) (5) for all values xk and Nk = {(f, r) | M (f,r) k > 0} (and similarly, wΔ v = wΔ k def= 1 (f,r) Nk M (f,r) k wr f). These symmetries imply that uv(δ, w) = uk( δ, w) for all v Vk. Our objective can now be evaluated over the smaller set of representative parameters ( δ, w), with symmetries fully specified by P = (PF, PV). We say that such P is valid and we have L(δ, w) = L( δ, w) where L( δ, w) def= Ff PF |Ff| cf( δ, w) + Vk PV |Vk| uk( δ, w). Lifted factor graphs Associated with any valid partition P is a lifted graph which compactly illustrates the symmetries in the variational parameters and problem terms (1). A lifted factor graph has a super-node associated with each Vk and a super-factor with rf position cells associated with each Ff. Position cell (f, r) is associated with δr f, wr f (la- beled with just δr f) and has a super-edge labeled M (f,r) k with super-node k if M (f,r) k > 0. The term Nk defined after (5) can be seen as the neighborhood of k in the lifted graph. We similarly define the neighborhood of (f, r) as N (f,r) = {k | M (f,r) k > 0}. Example 1. Consider a model with factor graph depicted in figure 2a (with elimination order o = (R1, R2, T1, T2)) and clique symmetries θα2 = θα3 = θα4 = θα5 = θ. Associated with the graph is the partition PF = {Fg, Fy} where Fg = {α1} (green), Fy = {α2, α3, α4, α5} (yellow) and PV = {Vp, Vr, Vb} where Vp = {R1} (purple), Vr = {R2} (red), and Vb = {T1, T2} (blue).3 The top half of the middle panel indicates the oversymmetric parameter constrains specified by PF. The bottom half of that panel indicates the induced symmetries in the unary terms. For example, by examining the neighborhood of R1 in the factor graph, we see that δ0 R1 = δ1 α2+δ1 α3+ δ1 α1. Via the parameter symmetries δ1 α2 = δ1 α3 = δ1 y and δ1 α1 = δ1 g, we see that this is equivalent to δ0 R1 = 2 δ1 y + δ1 g. Gradient symmetries Just as the over-symmetric parameter constraints induce symmetries in the ground objective function, they also induce (a different set of) symmetries in the gradient of the ground objective. These gradient symmetries will be used to derive the gradient of the lifted problem, to identify exact symmetries in the ground problem, and to guide our coarseto-fine inference procedure. First, notice that the parameter symmetries imply not only symmetries in the values of the clique (4) and unary (5) terms, but also in their gradients. That is, for any (δ, w) S(PF), we have cα δrα( xr f) = cf δr f( xr f) and uv δ0v( xk) = uk δ0 k( xk) for any α Ff, v Vk, and for all values of xr f and xk. Now, consider the set E(f,r) k = v Vk N (f,r) v of ground factors α Ff whose rth position cell (α, r) has neighbor v = αr Vk. For any α E(f,r) k , the parameter δr α appears only in clique term cα in position r (multiplied by -1) and unary term uv (multiplied by +1). We can therefore identify symmetries in the objective gradient as L δrα( xk) = uv δ0v( xk) cf δrα( xk) = uk δ0 k( xk) cf δr f( xk) def= g(f,r) k;δ ( xk) (7) 3The main text used integers to index factor and variable partitions; here we use letters, representing figure colors, for clarity. (a) Valid, non-stable P, specifies over-symmetric constraints on the variational parameters of the ground problem. (b) Stable P, specifies symmetries in variational parameters that yield an exact solution to the ground problem. Figure 2: (a) valid, non-stable partition imposes over-symmetric constraints, yielding a looser bound. (b) Stable partition which characterizes exact symmetries of ground problem. More details in examples 1 and 2 of the text. for all values of xk (and similarly for the w gradients). Lifted gradients and ground optimality The gradients of the lifted objective (6) can be written using the gradient symmetries of the ground problem. Specifying the lifted gradients in this way allows us to identify partitions whose symmetric parameter constraints specify an exact solution to the ground problem. Since, via (7), δr α( xr f) = δr f( xr f) for all α Ff (and all values of xr f), we can write L δr f( xr f) = L δrα( xr f) = k N (f,r) |E(f,r) k | g(f,r) k;δ ( xr f) (8) where |E(f,r) k | = M (f,r) k |Vk| (and similarly for the w gradients). In the special case where (f, r) has one lifted neighbor N (f,r) = {k}, Ff = E(f,r) k and (8) simplifies to L δr f( xr f) = |Ff| L δrα( xr f), meaning that L/ δr α( xr f) = 0 when L/ δr f( xr f) = 0. Now, if all of the gradients of the lifted problem are zero, and its lifted graph has | N (f,r)| = 1 for all (f, r), then all of the ground gradients are also zero. This means that a solution of the lifted problem provides a solution to the ground problem. A partition P whose lifted graph satisfies the property that | N (f,r)| = 1 for all (f, r) is called stable. Example 2. If we replace PF from example 1 with F = {Fg, Fy, Fo} where Fg = {α1} (green), Fy = {α2, α3} (yellow), Fo = {α4, α5} (orange), then the partition is stable and S(PF) contains a solution to the ground variational problem. This is illustrated in figure 2b. Optimizing the lifted objective In this paper, we use a black-box procedure to perform global optimization utilizing the gradient of L. To ensure positivity of the weights, we define the parameters η = { ηr f | Ff PF, r N+ rf }, substituting wr f with ϵ+exp( ηr f) (and L/ ηr f = ( L/ wr f) exp( ηr f)). To ensure that the function is differentiable, we set w0 k = a( wΔ k ) where a(x) = ϵ+t log(1+exp(x/t)).4 This enables unconstrained optimization over the η and δ parameters. The gradients are straightforward to derive from the gradients in (Ping, Liu, and Ihler 2015) and are omitted for space. Coarse to fine approximation We now develop a coarse to fine procedure that interleaves relaxations of the over-symmetric parameter constraints S(PF) with optimization, eventually arriving at the coarsest stable partition P and a solution to the ground problem. While optimization touches only the lifted graph, relaxation steps must touch a subset of the ground factors and variables. To control this cost, we adapt the key ideas from (Berkholz, Bonsma, and Grohe 2013), which presented an asymptotically optimal algorithm for generating the coarsest stable partition of a graph, to our work. This procedure is 4Any t > 0, ϵ > 0 bound the objective since it generates larger unary weights: at(wΔ v ) > ϵ + max(0, wΔ v ) > max(0, wΔ v ). Algorithm 1 Syntactic coarsest stable partition 1: Set PV to group variables with identical Xv 2: Set F0 f for f N+ F 0 to group factors with identical θ 3: Q ; 4: for F = 1 to F 0 do 5: FF F0 f ; refine PV(0); 6: end for 7: while Q = do 8: Choose any (f, r) Q, set Q Q \ (f, r) 9: Choose a k N (f,r) where |E(f,r) k | 1 2|Ff| 10: F F + 1; 11: FF E(f,r) k ; Ff Ff \ FF ; 12: refine PV(f); 13: end while used as a pre-processing step in works on exact lifted variational inference such as (Mladenov, Globerson, and Kersting 2014). Syntactic coarsest stable partition Throughout our description, we maintain a valid partition and its associated lifted graph. Recall from the subsection on lifted gradients (and Figure 2b) that a valid partition is stable if and only if each position cell in its lifted graph has exactly one neighbor. Algorithm 1 maintains the set Q = {(f, r) | | N (f,r)| 2} of position cells that violate this property. The algorithm iteratively selects an (f, r) Q and one of its neighbors k N (f,r). Ground factors in Ff are then split into two groups (line 11), based on whether their rth position neighbor is a variable in Vk (recall from the section on gradient symmetries that this is E(f,r) k ), or not. In the appendix, we define the function refine PV that updates the partition PV to be consistent with the new neighbor counts M f v and M F v for all v, updating the lifted graph and Q as needed. The restriction of choosing k such that |E(f,r) k | 1 2 |Ff| (line 9) is necessary to maintain the complexity results of (Berkholz, Bonsma, and Grohe 2013), as discussed in the appendix. Coarse-to-fine algorithm One simple way to construct a coarse to fine inference procedure is to interrupt the main loop of Algorithm 1 to perform lifted inference with relaxations and symmetries specified by the current partition P. This would yield any-time bounds and ensure that we never produce a model finer than the coarsest stable partition. In practice, however, superior results are obtained by choosing refinement operations based on their predicted decrease of the objective value. Algorithm 2 illustrates this idea, where Sr f (detailed in next section) judges the quality of a split of (f, r), cost(P) is the estimated cost of one inference iteration (we count the number of operations used to compute the lifted score (6)), and β controls how much refinement we allow between inference calls. Algorithm 2 Coarse to fine inference 1: Execute lines 1 to 6 of Algorithm 1 Initialize P 2: Ψ 0; 3: repeat 4: if cost(P) > Ψ then 5: [ δ, w] = inference(P, δ, w) 6: Set Sr f via (9) (f, r) Q 7: Ψ cost(P) β; 8: end if 9: (f , r ) arg min(f,r) Q Sr f 10: Set N 1 N (f ,r ) as arg min of Sr f via (9) 11: F F + 1; 12: FF k N 1E(f ,r ) k ; Ff Ff \ FF ; 13: refine PV(f ) 14: Set Sr f via (9) for f {f , F}, r N+ rf 15: until P is stable 16: [ δ, w] = inference(P, δ, w) Stable P inference Objective based scoring function We first define a metric to measure the quality of a proposed parameter partition, then we will show how to optimize this metric efficiently, using only information contained in the lifted graph. For the set of ground parameters associated with an (f, r), we measure the error induced by the current parameter tying restrictions as the distance of the ground gradients to their projected gradient. Letting gr α be the set of ground gradients associated with an (α, r), this metric is d(Ff, r) = α Ff ||gr α γ||2 2, with γ = ( α Ff gr α)/|Ff|. We would like to find a split of Ff into Ff 1 and Ff 2 that minimizes d(Ff 1, r) + d(Ff 2, r) d(Ff, r), which is the change in the projected gradient distance. By symmetry, this split groups together factors with the same gr α. Recalling from (7) that gr α = g(f,r) k for all α E(f,r) k , we write this grouping as Ff i = k N i E(f,r) k where N 1 N (f,r) and N 2 = N (f,r) \ N 1 (adopting the convention |Ff 1| |Ff 2|). We can now form a compact optimization problem to find the best split. For any N N (f,r) we have d( k N E(f,r) k , r) = dr f( N ) where dr f( N ) = k N |E(f,r) k | || g(f,r) k γ||2 2 k N |E(f,r) k | g(f,r) k )/( k N |E(f,r) k |). Letting P( N (f,r)) be the power set (all subsets) of N (f,r), we solve Sr f = min N 1 P( N (f,r)) i=1 dr f( N i) dr f( N (f,r)). (9) This is a weighted K-Means (K=2) problem with data g(f,r) k and weights |E(f,r) k |. If we approximate this by using only one of the elements of g(f,r) k (for example only the gradient wrt δ(f,r) k (1), as we use on binary problems in this paper s experiments), the optimum can be found exactly by sorting g(f,r) k and looping over all splits. Weight Formula b ( x = y) V (x) V (y) ux ( x) V (x) (a) Complete Graph Weight Formula b ( x = y) L(x, y) (C(y) C(x)) ux ( x) C(x) lxy ( x = y) L(x, y) (b) Binary Collective Classification Weight Formula b ( x = y) Q1(x) Q2(y) b ( x = y) Q2(x) Q3(y) b ( x = y) Q3(x) Q1(y) ui x ( x) Qi(x), i {1, 2, 3} (c) Clique-Cycle Figure 3: Test models with distinct soft evidence (u and l terms). 10-3 10-2 10-1 100 101 Inference time (seconds) Upper bound C2F - Objective C2F - Size C2F - Syntactic Stable (a) Complete 10-2 10-1 100 101 Inference time (seconds) Upper bound C2F - Objective C2F - Size C2F - Syntactic Stable (b) Collective Classification 10-3 10-2 10-1 100 101 Inference time (seconds) Upper bound C2F - Objective C2F - Size C2F - Syntactic Stable (c) Clique Cycle Figure 4: Comparison of optimization at stable coloring (exact model symmetries) with our coarse to fine inference framework for three different splitting methods, on each of the test models with d = 160 domain size for the logical variables. Vertical black dashed lines indicate coarse-to-fine transitions for objective based splitting ( C2F - Objective curve). Transitions for other C2F curves occur at approximately the same positions. Experiments This section provides an empirical illustration of our lifted Gen DD algorithm. We demonstrate the superiority of our objective based approach over purely syntactic based approaches for generating a coarse to fine approximation sequence. We also demonstrate excellent any-time performance on MLNs with distinct soft evidence on every node compared to exact lifted variational approaches which are forced to ground the entire problem. Datasets and methodology Datasets. Figure 3 shows the models we use (similar models without evidence were used in (Mladenov and Kersting 2015; Bui, Huynh, and Sontag 2014) to evaluate performance on marginalization tasks). In all models b was set to 2.5 and the u terms specify random evidence uniformly distributed on [ ζ, ζ] where ζ = log((1 10 2)/10 2), so that the u terms act like probabilities between 0.01 and 0.99. For the collective classification problem, we also choose a random 10% of the x, y values associated with links L(x, y) and set lxy to a random value on [0, 5.0]. This can be interpreted as the strength of a link between two web pages. Parameter settings. In our experiments, we set ϵ = 10 3, t = 10 2 (in the definition of a(x)) and perform an LBFGS black box optimization with rank 20 Hessian correction. For the coarse to fine procedure, it is important to balance performing inference with model refinement. We found it worked well to perform a small number of inference iterations (30), followed by a small amount of model refinement (setting β = 1.25 in Algorithm 2). This worked better than trying to diagnose inference convergence to determine the transition point. Future work will look at more principled methods of balancing inference work with refinement. Timing methodology. In our experiments, we measure only the cost of performing inference, ignoring the cost of identifying symmetries. This is common in experiments on approximate lifted inference in other works (Mladenov, Globerson, and Kersting 2014; Broeck and Niepert 2014; Venugopal and Gogate 2014) as well as exact lifted variational inference approaches (Bui, Huynh, and Riedel 2012; Bui, Huynh, and Sontag 2014; Mladenov and Kersting 2015) which assume problem symmetries are available before inference. Works such as (Singla, Nath, and Domingos 2014; Kersting et al. 2010) note that identifying (approximate) symmetries can be a bottleneck. (Singla, Nath, and Domingos 2014) used a sparse hyper-cube based representation to specify approximate symmetries in MLNs; adapting lifted Gen DD to leverage a similar representation is an interesting direction for future work. Results Figure 4 compares our coarse to fine procedure vs. performing inference on the stable coloring (ground model in our setup) on each of our three test models. The blue curve ( C2F - objective ) performs objective-based splitting as per Algorithm 2 using the metric (9) to guide splitting choice. We also show syntactic splitting ( C2F - syntactic , orange) as per Algorithm 1, interleaving inference when the predicted inference cost exceeds a cost bound (then multiplying that bound by β as in Algorithm 2). C2F - Size (red) splits the largest super-factor in Q and chooses a random split that is as even as possible (adding random edges from N (f,r) to 10-3 10-2 10-1 Inference time (seconds) Upper bound C2F - Objective Stable 10-3 10-2 10-1 Inference time (seconds) Upper bound 10-3 10-2 10-1 100 101 Inference time (seconds) Upper bound 10-3 10-2 10-1 100 101 Inference time (seconds) Upper bound (d) d = 160 10-2 100 102 Inference time (seconds) Upper bound (e) d = 320 Figure 5: Scalability comparison. Panels (a)-(e) show bound vs time on instances of complete graph with evidence, varying logical variable domain size d (in MLN specification). C2F exhibits increasing advantage as model grows larger. N 2, stopping when k N 1 |E(f,r) k | > 1 2 |Ff| and setting N 1 = N (f,r) \ N 2 in line 10 of Algorithm 2), interleaving inference with refinement as in the other C2F methods. The stable (here, ground) inference process is shown in green. We see that objective-based refinement significantly outperforms the others, with size-based refinement worse, but still better than syntactic splitting. Objective-based splitting provides orders of magnitude speed-up over the stable coloring for similar quality results, and all methods provide better early performance than the stable coloring, which must touch the entire model to even provide an initial bound. Figure 5 demonstrates the scalability of our approach. Plots (a)-(e) show the running time for our coarse-to-fine ( C2F - Objective (blue)) method and the stable coloring method on instances of the complete graph, varying its size. Conclusions We presented a framework for lifting Gen DD for approximate inference by imposing over-symmetric constraints on the optimization parameters to induce symmetry in the variational bound. We developed a coarse-to-fine procedure that, guided by the objective, provides high-quality any-time approximations even in models with no exact symmetry. Our method provides orders of magnitude speed-up on our benchmarks, with increasing advantage for larger models. Acknowledgements This work was sponsored in part by NSF grant IIS-1254071. Description of refine PV. After creating a new partition FF (possibly via splitting it from Ff), algorithms 1 and 2 call refine PV to update PV such that variables having identical neighbor counts Mv = [M 1 v , . . . , M F v ] are grouped together (where M f v def= [M (f ,1) v . . . M (f , rf ) v ] for all f ). We assume that at entry, the neighbor counts M f v are correct for all v and f = 1 . . . F 1, and that PV correctly groups variables by these counts, i.e., M f v = M f k for all v Vk. The first loop updates N (f,r) v and N (F,r) v and their counts. The second loop groups variables by their signature [Kv, M (F,r) v ], creating the sets Vm k = {v | Kv = Algorithm 3 refine PV Input: f: if f = 0, FF was deleted from Ff before call. 1: for r = 1 to r F do 2: for α FF do Update ground neighbor lists 3: v = αr; 4: N (f,r) v N (f,r) v \ α; N (F,r) v N (F,r) v α; 5: M (f,r) v M (f,r) v + 1; M (F,r) v M (F,r) v 1; 6: end for 7: Vm k ( k, m) New variable groupings 8: for α FF do 9: v = αr; k Kv; m = M (F,r) v ; 10: Vk Vk \ v; Vm k Vm k v; 11: end for 12: for (k, m) : Vm k = do Update lifted graph 13: K k; 14: if m = last m(k) then New super-node 15: K K + 1; K K 16: M (f ,r ) K M (f ,r ) k (f , r ) Nk 17: end if 18: VK Vm k ; Kv K ( v VK ) 19: M (F,r) K m; M (f,r) K M (f,r) K m; 20: end for 21: end for Note: Updating M (f ,r ) k (for any k , f , r ) changes Nk , N (f ,r ), Q. For compactness, these changes are not shown. k, m = M (F,r) v }, where Kv is the partition holding variable v (v Vk with k = Kv). This produces the same grouping as would be obtained by grouping the full signature [M 1 v . . . M (F 1) v , M F v ] because only M (f,r) v and M (F,r) v counts change during the call. Note that the change to M (f,r) v does not affect our grouping since for all v Vm k , M (f,r) v is equal to its value before the call minus M (F,r) v (the updates to M (f,r) K on line 19 reflect this fact). Lastly, the third loop adds Vm k sets to PV and updates quantities associated with the lifted graph ( M, N terms). Vk will be over-written with Vm k where m is the last (k, m) encountered in the loop. In all other cases, k splits into a new super-node and its neighbors are copied (line 16). Complexity analysis Each set is implemented as a hash table supporting O(1) addition and deletion (N (f ,r ) v stores pointers to factor scope vectors). Hence, each α FF contributes O(|α|) time to the first two loops of refine PV. Throughout algorithms 1 and 2, an α is moved to an FF that is at most half the size of its current set. Thus, each α participates in at most log2 |F| sets throughout either algorithm. Equivalent to the result of (Berkholz, Bonsma, and Grohe 2013), the total time of the these loops is O(R |F| log |F|) where R = maxα F |α|. Throughout algorithms 1 and 2, the total time spent in the third loop of refine PV is proportional to the size of the lifted graph. This is because the loop adds nodes and edges to the lifted graph in O(1) and each addition is accompanied by at most one deletion (if M (f,r) K = 0 in line 19). Since the size of the lifted graph is at most the size of the ground graph (R |F|), the total time is dominated by the first two loops. Berkholz, C.; Bonsma, P.; and Grohe, M. 2013. Tight lower and upper bounds for the complexity of canonical colour refinement. In European Symposium on Algorithms, 145 156. Springer. Broeck, G. V. d., and Niepert, M. 2014. Lifted probabilistic inference for asymmetric graphical models. ar Xiv preprint ar Xiv:1412.0315. Bui, H. B.; Huynh, T. N.; and de Salvo Braz, R. 2012. Exact lifted inference with distinct soft evidence on every object. In AAAI. Bui, H. H.; Huynh, T. N.; and Riedel, S. 2012. Automorphism groups of graphical models and lifted variational inference. ar Xiv preprint ar Xiv:1207.4814. Bui, H. H.; Huynh, T. N.; and Sontag, D. 2014. Lifted tree-reweighted variational inference. ar Xiv preprint ar Xiv:1406.4200. Globerson, A., and Jaakkola, T. S. 2008. Fixing maxproduct: Convergent message passing algorithms for map lp-relaxations. In Advances in neural information processing systems, 553 560. Habeeb, H.; Anand, A.; Singla, P.; et al. 2017. Coarse-tofine lifted map inference in computer vision. ar Xiv preprint ar Xiv:1707.07165. Kersting, K.; Ahmadi, B.; and Natarajan, S. 2009. Counting belief propagation. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 277 284. AUAI Press. Kersting, K.; El Massaoudi, Y.; Hadiji, F.; and Ahmadi, B. 2010. Informed lifting for message-passing. In AAAI. Mladenov, M.; Ahmadi, B.; and Kersting, K. 2012. Lifted linear programming. In AISTATS, 788 797. Mladenov, M., and Kersting, K. 2015. Equitable partitions of concave free energies. In UAI, 602 611. Mladenov, M.; Globerson, A.; and Kersting, K. 2014. Lifted message passing as reparametrization of graphical models. In UAI, 603 612. Ping, W.; Liu, Q.; and Ihler, A. T. 2015. Decomposition bounds for marginal map. In Advances in Neural Information Processing Systems, 3267 3275. Poole, D. 2003. First-order probabilistic inference. In IJCAI, volume 3, 985 991. Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine learning 62(1):107 136. Singla, P., and Domingos, P. M. 2008. Lifted first-order belief propagation. In AAAI, volume 8, 1094 1099. Singla, P.; Nath, A.; and Domingos, P. M. 2014. Approximate lifting techniques for belief propagation. In AAAI, 2497 2504. Van den Broeck, G., and Darwiche, A. 2013. On the complexity and approximation of binary evidence in lifted inference. In Advances in Neural Information Processing Systems, 2868 2876. Venugopal, D., and Gogate, V. 2014. Evidence-based clustering for scalable inference in markov logic. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 258 273. Springer. Wainwright, M. J., and Jordan, M. I. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning 1(1 2):1 305. Wainwright, M. J.; Jaakkola, T. S.; and Willsky, A. S. 2005. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory 51(7):2313 2335.