# lifted_hingeloss_markov_random_fields__3a5cfaaa.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Lifted Hinge-Loss Markov Random Fields

Sriram Srinivasan ssriniv9@ucsc.edu UC Santa Cruz

Behrouz Babaki Behrouz.Babaki@polymtl.ca Polytechnique Montreal

Golnoosh Farnadi gfarnadi@ucsc.edu UC Santa Cruz

Lise Getoor getoor@ucsc.edu UC Santa Cruz

Statistical relational learning models are powerful tools that combine ideas from ﬁrst-order logic with probabilistic graphical models to represent complex dependencies. Despite their success in encoding large problems with a compact set of weighted rules, performing inference over these models is often challenging. In this paper, we show how to effectively combine two powerful ideas for scaling inference for large graphical models. The ﬁrst idea, lifted inference, is a wellstudied approach to speeding up inference in graphical models by exploiting symmetries in the underlying problem. The second idea is to frame Maximum a posteriori (MAP) inference as a convex optimization problem and use alternating direction method of multipliers (ADMM) to solve the problem in parallel. A well-studied relaxation to the combinatorial optimization problem deﬁned for logical Markov random ﬁelds gives rise to a hinge-loss Markov random ﬁeld (HLMRF) for which MAP inference is a convex optimization problem. We show how the formalism introduced for coloring weighted bipartite graphs using a color reﬁnement algorithm can be integrated with the ADMM optimization technique to take advantage of the sparse dependency structures of HLMRFs. Our proposed approach, lifted hinge-loss Markov random ﬁelds (LHL-MRFs), preserves the structure of the original problem after lifting and solves lifted inference as distributed convex optimization with ADMM. In our empirical evaluation on real-world problems, we observe up to a three times speed up in inference over HL-MRFs.

1 Introduction Statistical relational learning (SRL) frameworks compactly specify a probability distribution over groups of objects using ﬁrst-order logic. Most commonly, the probability distribution is deﬁned as a template for a graphical model which is instantiated (or grounded) over the objects in the domain. A variety of different SRL frameworks have been developed over the last decade, (see e.g., (De Raedt and Kersting 2011; Richardson and Domingos 2006; Getoor and Taskar 2007)). In this paper, we focus on hinge-loss Markov Random ﬁelds (HL-MRFs) (Bach et al. 2017), a recently introduced SRL framework based on weighted logical rules which makes inference tractable by deﬁning a convex inference objective. HL-MRF have been used

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

successfully in a wide variety of domains including NLP tasks (Beltagy, Erk, and Mooney 2014; Wang and Ku 2016), image processing (Aditya, Yang, and Baral 2018; Gridach, Haddad, and Mulki 2017), bioinformatics (Sridhar, Fakhraei, and Getoor 2016), search (Alshukaili, Fernandes, and Paton 2016), recommender systems (Kouki et al. 2017; Lalithsena et al. 2017) and more (Deng and Wiebe 2015; Ebrahimi, Dou, and Lowd 2016; Chen, Chen, and Qian 2014), with promising results. For SRL frameworks, exact inference is often computationally expensive because inference is performed over large grounded graphical models. However, this ground representation is typically derived from a much smaller set of logical rules and, depending on the data, often contains identical substructures. These identical substructures cause unnecessary work for the inference algorithm by repeatedly performing the same operations. Lifted inference (Kersting 2012; Kimmig, Mihalkova, and Getoor 2015; den Broeck et al. 2011; Kazemi and Poole 2016) aims to detect common substructures and uses them to avoid redundant computations. Lifted inference in SRL is a well-studied problem. A popular approach is to group objects that are indistinguishable given evidence, and perform inference by operating on these groups. First-order variable elimination (Poole 2003; de Salvo Braz, Amir, and Roth 2005) extends the standard variable elimination algorithm by summing over entire groups of random variables instead of one at a time. Lifted belief propagation (Singla and Domingos 2008) employs the same message-passing method as the standard belief propagation algorithm. It ﬁrst groups variables and forms super nodes which are connected via so-called super edges. Message passing is then performed over the graph with these super nodes and super edges. A modiﬁed version of belief propagation (BP) called counting BP (Kersting, Ahmadi, and Natarajan 2009) constructs a compressed factor graph by creating clusternodes and clusterfactors and uses a modiﬁed BP to perform inference. Some inference algorithms use the logical structure in a model for problem decomposition (Dechter and Mateescu 2007; Chavira and Darwiche 2008). The lifted versions of these algorithms perform this decomposition at the ﬁrst-order level (Gogate and Domingos 2010;

den Broeck et al. 2011). The exact lifting methods discussed above assume that variables in the problem of interest are discrete. This makes them inapplicable to languages such as PSL, which are deﬁned over continuous random variables. Recently developed lifted linear programming (Mladenov, Ahmadi, and Kersting 2012; Mladenov, Kersting, and Globerson 2014) and lifted convex quadratic programming (Mladenov, Kleinhans, and Kersting 2017) offer a method for ﬁnding and exploiting symmetries in linear programming and quadratic problems. Lifted linear and quadratic programming groups indistinguishable variables using the color reﬁnement algorithm to produce a smaller linear or quadratic program making inference faster. The inference algorithm in HL-MRFs relies on alternating direction method of multipliers (ADMM). ADMM is an iterative optimization method (Boyd et al. 2011) that provides an elegant approach for ﬁnding the saddle point in augmented Lagrangian. The ADMM algorithm for HL-MRFs use the structure in the objective function and solves the subproblems in each iteration using closed-form solutions. Our work integrates the concept of lifting using the color reﬁnement algorithm with ADMM to perform a more efﬁcient inference in HL-MRFs. Using ADMM for HL-MRFs (Bach et al. 2017) shows exponential performance gains over traditional LP/QP solvers. To our best knowledge this is the ﬁrst approach that combines ADMM with color reﬁnement to perform lifting for probabilistic inference. Our contributions are as follows: 1) we propose the ﬁrst method for detecting and eliminating the symmetries in HLMRFs inference problems using the color reﬁnement algorithm, By applying this method to the real-world datasets, we observe signiﬁcant reductions (up to 66%) in the size of problems; 2) we show how the lifted problem can be cast back into the same form as the original inference problem and solved using the specialized inference algorithm of HL-MRFs. The proposed integration of lifted inference and ADMM is essential to our goal. We compare solving the lifted problem with existing off-the-shelf solvers and the ADMM method, and demonstrate that lifting has a better pay off when the latter is employed; 3) we run a series of experiments on synthetically generated data, analyze the complicated relationships graph structures have with lifting, and show the effectiveness of LHL-MRFs on varied levels of symmetry.

2 Background In this section, we review several key topics on which our proposed approach for lifted HL-MRFs (Section 3) relies upon. We begin by reviewing probabilistic modeling and templating languages for logical MRFs, in particular HLMRFs and PSL. Next, we review the color reﬁnement algorithm which we use to perform lifting.

2.1 Markov Random Field, HL-MRFs & PSL Markov random ﬁelds are an expressive formalism for deﬁning probability distributions. A number of recent SRL approaches, notably Markov Logic (Richardson and Domingos 2006), use logic to deﬁne the potentials associated with

a Markov random ﬁeld. These languages translate weighted logical rules into potential functions, which are in turn used to deﬁne the Markov random ﬁeld. We refer to these Markov random ﬁelds as Logical Markov Random Fields.

Deﬁnition 1 (Markov random ﬁeld). Let y = y1, y2, ..., yn be set of n random variables, φ = {φ1, φ2, ..., φm} be m potentials describing different logical relations between variables. φi(y) is real valued scalar representing compliance of y with φi. Also, let w = w1, w2, . . . , wm be real valued weights associated with each potential. Then, a Markov random ﬁeld can be deﬁned as: P(y) exp w Tφ(y) and a logical Markov random ﬁeld is the same with its potentials deﬁned through logical statements and hence, φi 0, 1.

The potentials of the MRF deﬁne how the domain behaves. These potentials can be deﬁned using logic statements for logical MRFs. An expressive way of representing logical statements is as weighted rules, where each rule can be converted into clausal form (disjunctions of positive or negated literals). Every logical clause can be written as: _

where I+ is set of positive literals that participate in the clause and I is a set of literals that participate in the cause with a negation (Bach et al. 2017). The most probable assignment for the variables can be found by ﬁnding the maxiumum apostieri (MAP) estimate for the distribution argmaxy {0,1}n P(y) as:

argmax y {0,1}n w T min

i I+ yi + X

j I (1 yj), 1

However, this is a combinatorial optimization problem, and ﬁnding the assignment that maximizes the probability for binary random variables is equivalent to weighted MAX-SAT, a well-known NP-hard problem.

Hinge-loss MRFs A hinge-loss Markov random ﬁeld (HL-MRF) is a logical MRF in which random variables are relaxed to take value in the range of [0, 1]n instead of {0, 1}n as in logical MRFs. In order to convert a logical MRF to a HL-MRF, we ﬁrst introduce deﬁnitions for logical statements over these continuous values. Conjunction ( ), disjunction ( ) and negation ( ) are deﬁned y1 y2 = max{y1 + y2 1, 0}, y1 y2 = min{y1 + y2, 1} and y = 1 y. The indicates the relaxation over Boolean values. With the above relaxations the objective function in Equation 2, can be written as:

argmin y [0,1]n

i=1 wi max {li, 0} (3)

where li = 1 P

j I (1 yj) = y T xi ci, and xi {0, 1, 1}n is a vector that determines which variables participate in the speciﬁc potential i. xi,j = 0 implies variable yj does not participate, xi,j = 1 implies variable yj I and needs to be added, and xi,j = 1 implies

variable yj I+ and needs to be subtracted. ci is the constant associated with the potential. The constant is computed based on the variables that are observed and other constants in the equation.

Deﬁnition 2 (Hinge-loss energy function). Let y = {y1, y2, ..., yn} be n random variables, l = {l1, l2, ..., lm} be m linear constraints, and φ = {φ1, φ2, ..., φm} be m potentials such that, φi = (max{li, 0})di, where di {1, 2} provides a choice of two different loss functions, di = 1 (i.e., linear) and di = 2 (i.e, quadratic). For weights w {w1, w2, ..., wm} a hinge-loss energy function can be deﬁned as:

i=1 wiφi(y) =

i=1 wi max(

j=1 xijyj ci, 0)di

(4) where 0 y 1, and the HL-MRF is deﬁned as: P(y) = 1 Z(y)exp( f(y)), where Z(y) = R

y exp( f(y)) is a normalization factor.

A key advantage of using HL-MRFs is that by using the continuous approximation for the logic statements, inference of random variables turns into a convex optimization problem from a combinatorial problem. Additionally, this reformulation enables us to use the alternating direction method of multipliers (ADMM) (Boyd et al. 2011) to infer the variables. With ADMM, inference in HL-MRFs is scalable which allows us to perform inference on real-world datasets. The ﬁrst step in solving the problem with ADMM is to form the augmented Lagrangian function of the problem as:

L(y, yl) = min y,yl

i=1 wiφi(yl,i) +

j=1 X[0,1][yj]

s.t., yl,i = y i {1, ..., m} (5)

where X[0,1][yj] is an indicator function which produces zero if yi [0, 1] and inﬁnity otherwise, and yl is a matrix with m rows and n columns and yl,i represents ith row of the matrix. The augmented Lagrangian form of this is: L(y, yl, α) = miny,yl Pm i=1 wiφi(yl,i) + Pn j=1 X[0,1][yj]+Pm i=1 αT i (yl,i y)+ ρ

2 Pm i=1 ||yl,i y||2 2, where ρ > 0 is step size and α is a matrix of same dimension as yl and represents the dual variables. The update equations for ADMM at iteration t are αt i = αt 1 i + ρ(yt 1 l,i yt 1), yt l = argminyl L(yl, αt, yt 1), and yt = argminy L((yl, αt, yt 1). The ADMM updates ensure that y converges to the MAP state.

Probabilistic Soft Logic Probabilistic soft logic (PSL) is a declarative language for specifying HL-MRFs. A PSL model consists of a set of weighted logical rules, e.g. Horn clauses of the form wr : B1 . . . Bm H where Bi and H are predicates or negated predicates. We ground a rule r by replacing the variables with constants from data. Each wr R+ { } is the weight of the rule r. And each ground predicate x is coupled to an interpretation function I(x) [0, 1] representing its truth value.

Using the relaxation of logical operators, we deﬁne the notion of distance to satisfaction for each ground rule in a PSL model, which for example in the Horn clause above is equal to I(Vm i=1 Bi) I(H). A grounded PSL model induces a HL-MRF in which distance to satisfaction of each grounded rule forms a hinge-loss potential in Equation 4.

Example 1. Consider the following PSL model with one rule that represents a transitive Knows relationship among people:

w1 : Knows(P1, P2) Knows(P2, P3) Knows(P1, P3)

Assume that in our data we have three individuals: Bob, Dan, and Elsa, and the weight of w1 = 5. Given the observations I(Knows(Ben, Elsa)) = 1, I(Knows(Elsa, Dan)) = 1, and assuming that I(Knows(X, X)) = 0 for every individual X, our aim is to infer truth values for the remaining atoms. The grounded model consists of four atoms that participate in four grounded rules. Let us denote the unknown truth values by variables y1 . . . y4. The hinge-loss energy function will be: f(y) = 5 max(y1 y2, 0)2 + 5 max( y1 + y2 + y4 1, 0)2 + 5 max(y1 y4, 0)2 + 5 max( y3 + 1, 0)2.

2.2 Color reﬁnement

Color reﬁnement is a simple algorithm to identify similar nodes in a graph (Ramana, Scheinerman, and Ullman 1994). This algorithm has efﬁcient implementations that run in quasilinear time (Codenotti et al. 2013) and has been used in practical graph isomorphism tools and for lifted inference (Grohe et al. 2017). Color reﬁnement is an iterative algorithm that assigns colors to nodes in a sequence of reﬁnement rounds. For a graph G = (V, E), it ﬁrst initializes all nodes in V with the same color. In every reﬁnement round, any two nodes v, w V with the same color are reassigned to different colors if there is some color c such that v and w have a different number of neighbors with color c; otherwise no change is made. The reﬁnement stops when the colors of all nodes before and after the reﬁnement round remain the same. This state of the graph where the colors of nodes do not change across reﬁnement rounds is called a stable coloring of the graph. Let A be the adjacency matrix of G. Then the nodes u and v have the same color in the stable coloring of G iff it holds for every color C that P

w C Avw = P

w C Av w . The color reﬁnement algorithm can be generalized to weighted graphs by reﬁning the colors based on weighted sum of the edges instead of degree. This generalization can then be extended to weighted bipartite graphs, where each part is initialized with a different color. In a stable stable bicoloring of a weighted bipartite graph, the condition mentioned above holds for the weighted adjacency matrix of G.

Graphical models generated from logical templates can manifest degrees of symmetry. In this section we introduce a method based on the color reﬁnement algorithm to ﬁnd and eliminate such symmetries in an HL-MRF energy function.

variable potential

(a) Factor graph: initial coloring

variable potential

(b) Colored factor graph

lifted variable lifted potential

(c) Lifted factor graph

Figure 1: The factor graph of the HL-MRF model presented in Example 1. The labels of the factor nodes appear on their right side. Edge weights are represented by line style (solid: 1, dashed: -1, thick: 2).

The function obtained by this method is also a HL-MRF energy function. Preserving this form is crucial, since the efﬁcient ADMM method introduced earlier is tailored for functions of this form. At the end of this section, we show that we can obtain the solution of the original problem by solving the lifted problem and mapping its solution back to the original space.

3.1 Lifted HL-MRFs (LHL-MRFs) Our lifting method operates on a factor graph, which is a graphical representation of a HL-MRF energy function.

Deﬁnition 3 (Factor Graph). The factor graph of an HLMRF energy function is a graph G = (U, V, E) in which there is a node uj U for each variable yj ( j = 1, . . . , n) and a node vi V for each potential φi ( i = 1, . . . , m). For each nonzero coefﬁcient xij of variable yj in potential φi there is an edge eij E between uj and vi with the weight xij. Each node vi V is labeled by the tuple (wi, ci, di).

Example 2. The energy function of the HL-MRF in Example 1 can be represented by the factor graph in Fig. 1a.

We will now describe a method that given the factor graph G of an energy function f, produces a potentially smaller factor graph G . Instead of solving the MAP inference problem for f, one can solve the MAP inference problem for the function f represented by G and map the solution back to the variables in f. We ﬁrst assign the initial colors to the nodes of G = (U, V, E). The nodes in V receive different colors based on their labels: Two nodes with labels (w1, c1, d1) and (w2, c2, d2) receive the same initial color iff c1 = c2, w1 = w2 and d1 = d2. All nodes in U receive the same color, which is different from the colors of the nodes in V . We then run the color-reﬁnement algorithm on G, which outputs a stable bi-coloring CU 1 , . . . , CU p for the nodes in U and CV 1 , . . . , CV q for the nodes in V . To create the lifted factor graph G = (U , V , E ), we ﬁrst create a lifted variable node u k for every color class CU k and a lifted factor node v l for every color class CV l . Each lifted variable node u k and lifted factor node v l corresponds to a set of edges in G, namely Ekl = {eij E : vi CV l , uj CU k }.

If Ekl is non-empty, we connect the nodes u k and v l in G by an edge with the weight (P (i,j):eij Ekl xij)/|CV l |. Let I = {i : vi CV l } and (w, c, d) be the label of some v CV l . We label the node v l V by the tuple (P

i I wi, c, d). Example 3. The output coloring of the color reﬁnement algorithm is shown in Fig. 1b. According to this coloring, the variables are partitioned into sets {y1}, {y2, y4}, {y3} and the factors are partitioned into sets {φ2}, {φ1, φ3}, {φ4} . From the color classes of Example 2, we obtain the lifted factor graph of Fig. 1c which represents the function: f(y y y ) = 5 max( y 1 + 2y 2 1, 0)2 + 10 max(y 1 y 2, 0)2 + 5 max( y 3 + 1, 0)2. As noted before, the expression in the above example is essentially a weighted sum of hinge functions which has the exact same form as (4). This implies that the function represented by the lifted factor graph can also be solved using ADMM. To map the solution of lifted problem back to the original space, we only need to assign the value of the representative variables of each lifting color class to all the variables in that color class.

3.2 Correctness of the method We now show that optimizing over the lifted function produces the same objective value as optimizing the original function, and the optimal values of the variables in the original problem can be derived from their lifted counterparts. Our proof is based on an existing procedure for lifting the Quadratic Programming (QP) problems (Mladenov, Kleinhans, and Kersting 2017). We show that the MAP inference problem in HL-MRFs can be cast as a QP problem and that lifting this problem produces another QP which is equivalent to the function produced by our lifting method. To write the objective function of the MAP inference in Equation 4 as a QP, we replace the max functions by constraints over auxiliary variables ψi:

i wiψdi i s.t., ψi X

j xijyj ci i, y,ψ 0

(6) QP problems are lifted by performing the color reﬁnement algorithm on a graph called the coefﬁcient graph. We will

now explain how to construct the coefﬁcient graph for (6) (for further details we refer to Mladenov, Kleinhans, and Kersting (2017)). The coefﬁcient graph of (6) is the 4-tuple (U, V, Z, E) where the nodes uj U, vi U, and zi Z correspond to variable yj, constraint i, and variable ψi, respectively. For each nonzero coefﬁcient xij there is an edge with weight xij between the nodes uj and vi. For each constraint i, the nodes vi V and zi Z are connected by an edge with weight ci, and if di = 2 then there is a self-loop edge on zi with the weight wi. Initially all nodes in U Z have the same color, which is not shared by any node in Z. Two nodes vi1, vi2 V receive the same initial color iff ci1 = ci2. After performing the color reﬁnement algorithm on this coefﬁcient graph, the coefﬁcient graph of the lifted QP problem is constructed by grouping the variables and constraints of each color class together. The edge weights are aggregated in the same way as previously described in our method. The optimal value of a variable in the original QP is equal to the optimal value of its lifted counterpart.

variable constraint

Figure 2: The colored coefﬁcient graph of the HL-MRF model presented in Example 1. Edge weights are represented by line style (solid: 1, dashed: -1, dotted: 5).

Example 4. The coefﬁcient graph of the QP corresponding to Example 1 and the coloring assigned to it by the color reﬁnement algorithm is presented in Fig. 2. Assume that a function f is lifted to another function f using our proposed method. We demonstrate the correctness of our method by showing that the QP of f can be lifted to the QP of f . Theorem 1. Let G = (U G, V G, EG) be the factor graph of an energy function of an HL-MRF, and Q = (U Q, V Q, ZQ, EQ) be the coefﬁcient graph of its QP. Then in the stable bi-colorings of G and Q, the color classes of U and V are the same.

Proof. Assume that in the stable bi-coloring CG of factor graph G, the nodes are partitioned into disjoint colors CG 1 , . . . , CG q V G and CG q+1, . . . , CG q+p U G. Let us denote by u G j and v G i the nodes in G corresponding to variable yj and factor φi in the HL-MRF energy function. Also let u Q j , v Q i and z Q i denote the nodes corresponding to variable yj, constraint i, and auxiliary variable ψi in the corresponding QP problem. We will now construct a stable bi-coloring CQ of the coefﬁcient graph Q with the following properties:

1) Variable nodes have the same color classes in CQ and CG and constraint nodes in CQ have the same color classes as factor nodes in CG, 2) CQ is consistent with the initial coloring of Q, and 3) CQ is the coarsest stable bi-coloring of the graph Q. We ﬁrst construct the coloring CQ and then show how the above conditions hold for it. Let C(u) denote the color class of u in the coloring C. In CQ we assign the colors to the nodes in U Q, V Q, and ZQ based on the color classes of U G and V G in CG as follows:

u Q j1 CQ(u Q j2) u G j1 CG(u G j2) (7)

v Q i1 CQ(v Q i2) v G i1 CG(v G i2) (8)

z Q i1 CQ(z Q i2) v G i1 CG(v G i2) (9)

The ﬁrst property holds by deﬁnition. By deﬁnition, the nodes u Q j1, u Q j2 U Q have the same initial color iff the initial colors of the nodes u G j1, u G j2 U G are the same. Similarly, v Q i1, v Q i2 V Q receive the same initial color iff the nodes v G i1, v G i2 V G have the same initial color. Additionally, all nodes in ZQ receive the same initial color. Hence CQ is consistent with the initial coloring of Q. To show that the coloring is stable, we need to show that the sum of edge weights connecting to the nodes in each color class is the same among all the nodes having the same color. So for each pair of variable nodes u Q j1, u Q j2 U Q

and color class CQ l it should hold that u Q j1 CQ(u Q j2) P

i:v Q i CQ l xij1 = P

i:v Q i CQ l xij2. Since CG is a stable col-

oring of G we have u G j1 CG(u G j2) P

i:v G i CG l xij1 = P i:v G i CG l xij2 which together with equation 7 proves this property. Similarly, for each pair of constraints i1, i2 and color class CQ k it should hold that v Q i1 CQ(v Q i2) P

j:u Q j CQ k xi1j = P

j:u Q j CQ k xi2j which can be concluded

from equation 8 and the fact that v G i1 CG(v G i2) P

j:u G j CG k xi1j = P

j:u G j CG k xi2j. Note that the weights of edges connecting to nodes of ψi are not included in these equations since ψi variables appear with the same coefﬁcient in all constraints. For a pair of nodes z Q i1, z Q i2 ZQ

where di1 = di2 = 1, we should have z Q i1 CQ(z Q i2) v Q i1 CQ(v Q i2) which trivially holds according to equation 9. Finally, when di1 = di2 = 2, it should hold that z Q i1 CQ(z Q i2) v Q i1 CQ(v Q i2) wi1 = wi2 which holds according to equation 9 and the fact that if wi1 = wi2 then the nodes v Q i1, v Q i2 V Q are initialized with different colors. Now what remains is to show that CQ is the coarsest stable coloring of graph Q, i.e., there is not another stable bi-coloring respecting the previous conditions that assigns fewer number of colors than CQ to the nodes of Q. Assume that there is a stable bi-coloring C Q of Q with fewer colors than CQ. Then we can construct a stable bi-coloring C G for the factor graph G that respects its initial coloring, by partitioning the U G and V G according to the color classes of U Q and V Q in C Q. Since partitions of U Q and ZQ are in one-to-one correspondence, the reduction in the number of

color classes in C Q can not be limited to color classes of the nodes in ZQ. This means that C G has fewer color classes than CG, which is a contradiction.

4 Empirical Evaluation In this section, we evaluate our proposed lifted inference algorithm, LHL-MRF, on various real and synthetic datasets. We investigate three research questions in our experiments: Q1: How does lifting affect performance on real world datasets? Q2: How does the graph structure inﬂuence the impact of lifting? Q3: How much symmetry is required for lifting HL-MRFs to be effective? All experiments were run on a machine with 16GB RAM and an i5 processor. The implementations are all single-threaded. We implemented our models using the PSL open-source Java library1. We ground the rules using the PSL library and then run inference using our own implementation of ADMM in C++2. Note that PSL removes a large number of trivial symmetries during the grounding process by removing trivially satisﬁed rules (for further information see (Augustine and Getoor 2018)). Removing these simple symmetries ensures the extra symmetries that are obtained during our approach are non-trivial. We use Saucy3 from the RELOOP library to perform color reﬁnement (Mladenov et al. 2016).

Experiments on Real-world Data We selected three real world datasets from different domains for which the corresponding PSL models have been used with promising results. -Citeseer: This dataset includes 3312 papers in six categories, and 4591 citation links. The goal is to classify documents in a citation network. The original data comes from Citeseer . The details about the model and data can be found in Bach et al. (2017). -Cora: This dataset includes includes 2708 papers in seven categories, and 5429 citation links. The goal is to classify documents in a citation network. The original data comes from Cora . The details about the model and data can be found in Bach et al. (2017). -Wikidata: The dataset contains 419 families and 1,844 family trees. The goal is to perform entity-resolution on a family graph obtained form wikidata by crawling the site for familial relations. The details about the model and data can be found in Kouki et al. (2017). To address Q1, we measure the effects of lifting on the three datasets. Figure 3 presents the number of variables and potentials of these datasets before and after lifting. We observe that there is a varying amount of symmetry in these datasets, the reduction in number of variables and potentials is about 20% in the Wikidata, 46% in the Cora, and 66% in the Citeseet dataset. Table 1 shows the time to solve the original problem, i.e., HL-MRF, the time to solve the lifted problem, i.e., LHLMRF (solving), the time to lift HL-MRF with the color reﬁnement algorithm, i.e., LHL-MRF (lifting), and the end

1https://github.com/linqs/psl 2https://github.com/linqs/srinivasan-aaai19 3http://vlsicad.eecs.umich.edu/BK/SAUCY

Datasets HL-MRF LHL-MRF LHL-MRF LHL-MRF (solving) (lifting) (total) (in sec) (in sec) (in sec) (in sec)

Citeseer 57.4 19.8 0.39 20.19 Cora 47.7 17.5 0.53 18.03 Wikidata 636.0 463.7 112.7 576.4

Table 1: Time taken to perform inference on different datasets.

Citeseer Cora Wikidata

Number of potentials + variables

Before Lifting

After Lifting

Figure 3: The number of variables and rules reduce in different amount after lifting in real-world datasets.

to end inference time for the lifted approach, i.e., LHLMRF (total) or LHL-MRF in short. As expected, due to the large amount of reduction in the number of variables and potentials, there is a signiﬁcant difference between the time taken for HL-MRF and LHLMRF (solving) on all subsets. Even with a small 20% reduction in number of variables and potentials in Wikidata, we see that LHL-MRF (Solving) is 27% faster than HLMRF and due to much higher reduction in other datasets, we see three-fold speed-ups in both the Cora and the Citeseer datasets. However, lifting time (LHL-MRF (lifting)) must also be considered. After accounting for this, we still see that LHLMRF is about a 10% faster in the Wikidata dataset and almost three times faster for the Cora and the Citeseer datasets when compared to HL-MRF.

Experiments on Synthetic Data To address Q2 and Q3 and better understand how symmetry is affected by graph density, we generate ﬁve different synthetic graphs. We also generate three sets of possible continuous values that the edges of the graph can take to generate different structure of neighbors in the graph. We generate the graphs for the task of node labeling with varying levels of density. We used a PSL model for the commonly used smoker example as introduced in Richardson and Domingos (2006), which describes smoking behavior among friends with the following rules: 1.0 : Friend(A, B) Smokes(A) Smokes(B) 1.0 : Friend(A, B) Smokes(A) Smokes(B) This model states that if two people are friends, then either both of them smoke or neither of them do.

10k 50k 100k 200k 500k 0

Graph Density

Number of potentials + variables

Before Lifting

After Lifting

(a) For binary values, as the graph density increases, the total amount of lifting increases.

10k 50k 100k 200k 500k

Graph Density

Ratio (full/lifted)

Binary 1 Decimal pt 4 Decimal pt

(b) For varying numbers of values, as graph density increases, the ratio of lifting varies.

10k 50k 100k 200k 500k 0

Graph Density

Time (in sec)

HL-MRF LHL-MRF(solving)

LHL-MRF LHL-MRF(lifting)

(c) Comparison of inferences times for HLMRF and LHL-MRF as graph density varies.

Figure 4: Comparison of inferences times and size of the problem for HL-MRF and LHL-MRF as graph density varies

We ﬁx the number of users to 1000, and randomly create friendship links between these users by varying the number of edges from 10k to 500k. In practice friendship is not necessarily a black-and-white matter, i.e., people can be friends to varying degrees. Hence, we consider three cases for the values of the friendship links: 1) binary values, 2) values between zero to one with one decimal point, 3) values between zero to one with four decimal points. This means that friendship links can take only two values in the ﬁrst case ({0, 1}), 10 values in the second ({0.0, 0.1, . . . , 1.0}) and 10,000 in the third case ({0.0000, 0.0001, . . . , 1.000}). We randomly assign a label to users and keep 50% of the labels as evidence and another 50% as unknowns to be inferred. Figure 4a shows the total number of variables and potentials before and after lifting for the binary case. Figure 4b shows the ratio between the number of variables and potentials before and after lifting, for varying value ranges. We see that for the binary case, the amount of lifting is maximized and the ratio increases as the graph density increases. However, as the value range increases, the amount of lifting drops signiﬁcantly, and eventually there are no symmetries to be exploited. Finally, Figure 4c presents the processing time to solve the binary case. The results indicate that using LHL-MRF gives a signiﬁcant performance improvement over HL-MRF as the graph becomes denser. These results imply that there are complex trade-offs between the structure of the graph and the range of the values in the data. We utilize exact lifting and therefore, we observe that LHLMRF performs well for ﬁnding symmetries in datasets with denser structures and smaller range of values. Finally, to further understand how the amount of symmetry affects the overall inference time in a slightly more complex and realistic setting (yet still a synthetic dataset), we study the social afﬁliation dataset and the PSL model used by Bach et al. (2017) for scalability analysis. We use a dataset that contains 22k nodes and 130k edges 4. We begin by lifting this dataset to remove all symmetry (the original dataset has less than 1% symmetry). To induce

4https://github.com/stephenbach/admm-speed-test

symmetry, we systematically inject the same structure to the data. This is done by duplicating every grounded rule and shufﬂing the data. We duplicate the grounded rules up to 10 times creating 10 subsets (named 1x, 2x, 3x..., 10x), where the 10x subset has 10 times as many potentials created by duplicating the original data i.e., the 1x subset.

1x 2x 3x 4x 5x 6x 7x 8x 9x 10x 0

Time (in sec)

HL-MRF LHL-MRF(solving)

LHL-MRF LHL-MRF(lifting)

Figure 5: As symmetry increases, the gap between time solving HL-MRF and LHL-MRF increases.

Figure 5 shows the results of HL-MRF and LHLMRF (split into LHL-MRF (solving), LHL-MRF (lifting), and LHL-MRF on all 10 subsets. For the smallest dataset (which contains no symmetry), the time required for LHLMRF is higher than HL-MRF due to the time taken to perform lifting. However, increasing the size of the dataset from two to ten, we observe that the amount of time taken by LHL-MRF to solve the problem is getting much lower than HL-MRF. It is noticeable that as the symmetry increases, the gap between solving HL-MRF problem and LHL-MRF problems widens. Note that the inference time in LHL-MRF for all 10 subsets is the same (equal to 1x dataset), which is the ﬂat line in Figure 5 for LHL-MRF (solving). For the sake of completeness and to compare against other lifted inference methods, in Figure 6, we compare the performance of HL-MRF and LHL-MRF using ADMM with

1x 2x 3x 4x 5x 6x 7x 8x 9x 10x

Log of Time (in sec)

HL-MRF HL-MRF(Gurobi)

LHL-MRF LHL-MRF(Gurobi)

Figure 6: As symmetry increases, the gap between time solving HL-MRF and HL-MRF(Gurobi) increases exponentially. The difference between LHL-MRF and LHLMRF(Gurobi) remains the same as the 1x dataset.

versions which use Gurobi an off-the-shelf commercial QP solver . We denote these methods which use Gurobi HLMRF (Gurobi) and LHL-MRF (Gurobi). For all 10 subsets, we observe that using HL-MRF and LHL-MRF consistently and signiﬁcantly outperforms HL-MRF (Gurobi) and LHLMRF (Gurobi) respectively. We also see that this difference increases as the size of the data increases. Note that the time taken to solve using LHL-MRF (Gurobi) is similar to other lifting methods such as belief propagation. For most of the lifting methods, the time complexity grows cubicly with the number of variables in the data. However, our approach is unique and desirable as it maintains the original form of the function allowing us to use ADMM which is known to be much more scalable than other approaches (Forouzan and Ihler 2013). To our best knowledge, the size of datasets used in other lifted inference papers are in order of 1000s of variables and potentials, whereas using ADMM in our approach allow us to easily scale to problems with millions of variables and potentials.

5 Discussion and Future Directions

We have shown that there are signiﬁcant opportunities for lifted inference, even in the case where we have continuousvalued variables deﬁned by HL-MRFs. Through empirical evaluation on real datasets, we show that the inference task for HL-MRF models can run up to three times faster by using LHL-MRFs. However, it is important to note that LHLMRFs cannot guarantee speed-ups for all types of problems. We investigate the effects of graph density and range of real values on lifting in HL-MRFs. However, studying the characteristics of the optimization problem after lifting is left for future work. We also notice that on small sized problem, in which inference takes less than one second to ﬁnish in HL-MRFs, the overhead of lifting is noticeable, and therefore even with a huge amount of reduction in the number of variables and potentials, we cannot necessarily reduce the solving time of LHL-MRFs. This work suggests other interesting directions for future

work. First, in this work we only exploit exact symmetries, which may be hard to ﬁnd in some applications. Previous work indicates approximate lifted inference can improve the performance without compromising on other metrics like precision (Sen, Deshpande, and Getoor 2009). In our setting, approximate lifting could also lead to a greater reduction in number of variables and speed up the task of inference. Second, two of the most challenging tasks in MRFs are learning the weights and the structure of the logical rules from the data. Structure learning and weight learning are often performed using a scoring function that iteratively uses a MAP state. An interesting path to explore is to employ lifted inference to make such systems more efﬁcient.

6 Conclusion In this paper, we introduced LHL-MRF, a novel approach to lifted inference in HL-MRFs. LHL-MRF marries the powerful ideas of lifted inference with the color reﬁnement algorithm of Grohe et al. (2017) with the convex inference approach proposed by Bach et al. (2017), to solve largescale graphical models described by HL-MRFs. By combining these two ideas, our method is able to reduce the number of variables and potentials in a model and perform inference efﬁciently on a signiﬁcantly smaller optimization problem. Through empirical evaluation, we show that the inference task for HL-MRF models on relatively small real world problems can be made to run three times faster. Further, in our experiments, we investigated how varying symmetry affects the performance of LHL-MRF and we explored the impact of both structure and domain values on the efﬁciency of LHL-MRF.

Acknowledgement We would like to thank Kristian Kersting for his constructive feedback and the anonymous reviewers for their helpful suggestions. This work is supported by the National Science Foundation under grant numbers CCF-1740850 and IIS1703331, and by the US Army Corps of Engineers Research and Development Center under contract number W912HZ17-P-0101. Behrouz Babaki is supported by a postdoctoral scholarship from IVADO through the Canada First Research Excellence Fund (CFREF) grant.

References Aditya, S.; Yang, Y.; and Baral, C. 2018. Explicit reasoning over end-to-end neural architectures for visual question answering. In AAAI. Alshukaili, D.; Fernandes, A. A. A.; and Paton, N. W. 2016. Structuring linked data search results using probabilistic soft logic. In ISWC. Augustine, E., and Getoor, L. 2018. A comparison of bottom-up approaches to grounding for templated Markov random ﬁelds. In Sys ML. Bach, S. H.; Broecheler, M.; Huang, B.; and Getoor, L. 2017. Hinge-loss Markov random ﬁelds and probabilistic soft logic. Journal of Machine Learning Research 18:109:1 109:67.

Beltagy, I.; Erk, K.; and Mooney, R. J. 2014. Probabilistic soft logic for semantic textual similarity. In ACL. Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning. Chavira, M., and Darwiche, A. 2008. On probabilistic inference by weighted model counting. Artif. Intell. 172(67):772 799. Chen, P.; Chen, F.; and Qian, Z. 2014. Road trafﬁc congestion monitoring in social media with hinge-loss Markov random ﬁelds. In ICDM. Codenotti, P.; Katebi, H.; Sakallah, K. A.; and Markov, I. L. 2013. Conﬂict analysis and branching heuristics in the search for graph automorphisms. In ICTAI. De Raedt, L., and Kersting, K. 2011. Statistical relational learning. In Encyclopedia of Machine Learning. Springer. de Salvo Braz, R.; Amir, E.; and Roth, D. 2005. Lifted ﬁrst-order probabilistic inference. In IJCAI. Dechter, R., and Mateescu, R. 2007. AND/OR search spaces for graphical models. Artif. Intell. 171(2-3):73 106. den Broeck, G. V.; Taghipour, N.; Meert, W.; Davis, J.; and Raedt, L. D. 2011. Lifted probabilistic inference by ﬁrstorder knowledge compilation. In IJCAI. Deng, L., and Wiebe, J. 2015. Joint prediction for entity/event-level sentiment analysis using probabilistic soft logic models. In EMNLP. Ebrahimi, J.; Dou, D.; and Lowd, D. 2016. Weakly supervised tweet stance classiﬁcation by relational bootstrapping. In EMNLP. Forouzan, S., and Ihler, A. 2013. Linear approximation to admm for map inference. In JMLR W&CP, 48 61. Getoor, L., and Taskar, B. 2007. Introduction to statistical relational learning. MIT. Gogate, V., and Domingos, P. M. 2010. Exploiting logical structure in lifted probabilistic inference. In Star AI Workshop at AAAI. Gridach, M.; Haddad, H.; and Mulki, H. 2017. Churn identiﬁcation in microblogs using convolutional neural networks with structured logical knowledge. In NUT workshop at EMNLP. Grohe, M.; Kersting, K.; Mladenov, M.; and Schweitzer, P. 2017. Color reﬁnement and its applications. In Van den Broeck, G.; Kersting, K.; Natarajan, S.; and Poole, D., eds., An Introduction to Lifted Probabilistic Inference. Cambridge University Press. Kazemi, S. M., and Poole, D. 2016. Knowledge compilation for lifted probabilistic inference: Compiling to a low-level language. In KR. Kersting, K.; Ahmadi, B.; and Natarajan, S. 2009. Counting belief propagation. In AUAI, 277 284. Kersting, K. 2012. Lifted probabilistic inference. In ECAI. Kimmig, A.; Mihalkova, L.; and Getoor, L. 2015. Lifted graphical models: a survey. Machine Learning 99(1):1 45.

Kouki, P.; Pujara, J.; Marcum, C.; Koehly, L. M.; and Getoor, L. 2017. Collective entity resolution in familial networks. In ICDM. Lalithsena, S.; Perera, S.; Kapanipathi, P.; and Sheth, A. P. 2017. Domain-speciﬁc hierarchical subgraph extraction: A recommendation use case. In Big Data. Mladenov, M.; Ahmadi, B.; and Kersting, K. 2012. Lifted linear programming. In AISTATS. Mladenov, M.; Heinrich, D.; Kleinhans, L.; Gonsior, F.; and Kersting, K. 2016. Reloop: A python-embedded declarative language for relational optimization. In De LBP Workshop at AAAI. Mladenov, M.; Kersting, K.; and Globerson, A. 2014. Efﬁcient Lifting of MAP LP Relaxations Using k-Locality. In AISTATS. Mladenov, M.; Kleinhans, L.; and Kersting, K. 2017. Lifted inference for convex quadratic programs. In AAAI. Poole, D. 2003. First-order probabilistic inference. In IJCAI. Ramana, M. V.; Scheinerman, E. R.; and Ullman, D. 1994. Fractional isomorphism of graphs. Discrete Mathematics. Richardson, M., and Domingos, P. M. 2006. Markov logic networks. Machine Learning 62(1-2):107 136. Sen, P.; Deshpande, A.; and Getoor, L. 2009. Bisimulationbased approximate lifted inference. In UAI. Singla, P., and Domingos, P. M. 2008. Lifted ﬁrst-order belief propagation. In AAAI. Sridhar, D.; Fakhraei, S.; and Getoor, L. 2016. A probabilistic approach for collective similarity-based drug-drug interaction prediction. Bioinformatics 32(20):3175 3182. Wang, W., and Ku, L. 2016. Identifying chinese lexical inference using probabilistic soft logic. In ASONAM.