# coarsetofine_lifted_map_inference_in_computer_vision__b939f0bf.pdf

Coarse-to-Fine Lifted MAP Inference in Computer Vision

Haroun Habeeb and Ankit Anand and Mausam and Parag Singla Indian Institute of Technology Delhi haroun7@gmail.com and {ankit.anand,mausam,parags}@cse.iitd.ac.in

There is a vast body of theoretical research on lifted inference in probabilistic graphical models (PGMs). However, few demonstrations exist where lifting is applied in conjunction with top of the line applied algorithms. We pursue the applicability of lifted inference for computer vision (CV), with the insight that a globally optimal (MAP) labeling will likely have the same label for two symmetric pixels. The success of our approach lies in efﬁciently handling a distinct unary potential on every node (pixel), typical of CV applications. This allows us to lift the large class of algorithms that model a CV problem via PGM inference. We propose a generic template for coarse-to-ﬁne (C2F) inference in CV, which progressively reﬁnes an initial coarsely lifted PGM for varying quality-time trade-offs. We demonstrate the performance of C2F inference by developing lifted versions of two near state-of-the-art CV algorithms for stereo vision and interactive image segmentation. We ﬁnd that, against ﬂat algorithms, the lifted versions have a much superior anytime performance, without any loss in ﬁnal solution quality.

1 Introduction

Lifted inference in probabilistic graphical models (PGMs) refers to the set of the techniques that carry out inference over groups of random variables (or states) that behave similarly [Jha et al., 2010; Kimmig et al., 2015]. A vast body of theoretical work develops a variety of lifted inference techniques, both exact (e.g., [Poole, 2003; Braz et al., 2005; Singla and Domingos, 2008; Kersting, 2012]) and approximate (e.g., [Singla et al., 2014; Van den Broeck and Niepert, 2015]). Most of these works develop technical ideas applicable to generic subclasses of PGMs, and the accompanying experiments are aimed at providing ﬁrst proofs of concepts. However, little work exists on transferring these ideas to the top domain-speciﬁc algorithms for real-world applications. Algorithms for NLP, computational biology, and computer vision (CV) problems make heavy use of PGM machinery (e.g., [Blei et al., 2003; Friedman, 2004; Szeliski et al., 2008]). But, they also include signiﬁcant problem-speciﬁc

insights to get high performance. Barring a handful of exceptions [Jernite et al., 2015; Nath and Domingos, 2016], lifted inference hasn t been applied directly to such algorithms. We study the potential value of lifting to CV problems such as image denoising, stereo vision, and image segmentation. Most CV problems are structured output prediction tasks, typically assigning a label to each pixel. A large class of solutions are PGM-based: they deﬁne a Markov Random Field (MRF) that has each pixel as a node, with unary potential that depends on pixel value, and pairwise neighborhood potentials that favor similar labels to neighboring pixels. We see three main challenges in applying existing lifted inference literature to these problems. First, most existing algorithms focus on computing marginals [Singla and Domingos, 2008; Kersting et al., 2009; Gogate and Domingos, 2011; Niepert, 2012; Anand et al., 2016; 2017] instead of MAP inference. Second, among the algorithms performing lifted MAP [Noessner et al., 2013; Mladenov et al., 2014; Sarkhel et al., 2014; Mittal et al., 2014], many of the algorithms focus on exact lifting. This breaks the kind of symmetries we need to compute since different pixels may not have exact same neighborhood. Third, the few algorithms that perform approximate lifting for MAP, e.g. [Sarkhel et al., 2015], can t handle a distinct unary potential on every node. This is essential for our application since image pixels take ordinal values in three channels. In response, we develop an approximate lifted MAP inference algorithm which can effectively handle unary potentials. We initialize our algorithm by merging together pixels having the same order of top-k labels based on the unary potential values. We then adapt an existing symmetry ﬁnding algorithm [Kersting et al., 2009] to discover groupings which also have similar neighborhoods. We refer to our groupings as lifted pixels. We impose the constraint that all pixels in a lifted pixel must be assigned the same label. Our approximate lifting reduces the model size drastically leading to signiﬁcant time savings. Unfortunately, such approximate lifting could adversely impact solution quality. However, we vary the degree of approximation in symmetry ﬁnding to output a sequence of coarse-to-ﬁne models with varying quality-time trade-offs. By switching between such models, we develop a coarse-to-ﬁne (C2F) inference procedure applicable to many CV problems. We formalize these ideas in a novel template for using

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

lifted inference in CV. We test C2F lifted inference on two problems: stereo matching and image segmentation. We start with one of the best MRF-based solvers each for both problems neither of these are vanilla MRF solvers. Mozerov & Weijer [2015] use a two-way energy minimization to effectively handle occluded regions in stereo matching. Cooperative cuts [Kohli et al., 2013] for image segmentation use concave functions over a predeﬁned set of pixel pairs to correctly segment images with sharp edges. We implement C2F inference on top of both these algorithms and ﬁnd that C2F versions have a strong anytime behavior given any amount of inference time, they output a much higher quality (and are never worse) than their unlifted counterparts, and don t suffer any loss in the ﬁnal quality. Overall, our contributions are: 1. We present an approximate lifted MAP algorithm that can efﬁciently handle a large number of distinct unary potentials. 2. We develop a novel template for applying lifted inference in structured prediction tasks in CV. We provide methods that output progressively ﬁner approx. symmetries, leading to a C2F lifted inference procedure. 3. We implement C2F inference over a near state-of-theart stereo matching algorithm, and one of the best MRFbased image segmentation algorithms. We release our implementation for wider use by the community.1

4. We ﬁnd that C2F has a much superior anytime behavior. For stereo matching it achieves 60% better quality on average in time-constrained settings. For image segmentation C2F reaches convergence in 33% less time. 2 Background 2.1 Computer Vision Problems as MRFs Most computer vision problems are structured output prediction problems and their PGM-based solutions often follow similar formulations. They cast the tasks into the problem of ﬁnding the lowest energy assignment over grid-structured MRFs (denoted by G = (X, γ)). The random variables in these MRFs are the set of pixels X in the input image. Given a set of labels L : {1, 2, . . . , |L|}, the task of structured output prediction is to label each pixel X with a label from L. The MRFs have two kinds of potentials (γ) unary and higherorder. Unary potentials are deﬁned over each individual pixel, and usually incorporate pixel intensity, color, and other pixel features. Higher order potentials operate over cliques (pairs or more) of neighboring pixels and typically express some form of spatial homophily neighboring pixels are more likely to have similar labels. While the general PGM structure of various tasks are similar, the speciﬁc potential tables and label spaces are task-dependent. The goal is to ﬁnd the MAP assignment over this MRF, which is equivalent to energy minimization (by deﬁning energy as negative log of potentials). We denote the negative log of unary potentials by φ, and that of higher-order potentials by ψ.2 Thus, energy of a complete assignment x L|X|

1https://github.com/dair-iitd/c2ﬁ4cv/ 2In the interest of readability, we say potential to mean negative log of potential in the rest of the paper.

can be deﬁned as:

i 1..|X| φ(xi) + X

j ψj(ˆxj) (1)

Here ˆxj denotes the assignment x restricted to the set of variables in the potential ψj. And the output of the algorithm is the assignment x MAP:

x MAP = arg min x L|X| E(x) (2)

The problem is in general intractable. Efﬁcient approximations exploit special characteristics of potentials like submodularity [Jegelka and Bilmes, 2011], or use variants of graph cut or loopy belief propagation [Boykov et al., 2001; Freeman et al., 2000].

2.2 Symmetries in Graphical Models Lifting an algorithm often requires computing a set of symmetries that can be exploited by that algorithm. For PGMs, two popular methods for symmetry computation are color passing for computing symmetries of variables [Kersting et al., 2009], and graph isomorphism for symmetries of states [Niepert, 2012; Bui et al., 2013]. Since our work is based on color passing, we explain it in more detail. Color passing for an MRF operates over a colored bipartite graph containing nodes for all variables and potentials, and each node is assigned a color. The graph is initialized as follows: all variables nodes get a common color; all potential nodes with exactly same potential tables are assigned a unique color. Now, in an iterative color passing scheme, in each iteration, each variable node passes its color to all neighboring potential nodes. The potential nodes store incoming color signatures in a vector, append their own color to it, and send the vector back to variable nodes. The variable nodes stack these incoming vectors. New colors are assigned to each node based on the set of incoming messages such that two nodes with same messages are assigned the same unique color. This process is repeated until convergence, i.e., no further change in colors. A coloring of the bipartite graph deﬁnes a partition of variable nodes such that all nodes of the same color form a partition element. Each iteration of color passing creates successively ﬁner partitions, since two variable nodes, once assigned different colors, can never get the same color.

3 Lifted Computer Vision Framework

In this section, we will describe our generic template which can be used to lift a large class of vision applications including those in stereo, segmentation etc. Our template can be seen as transforming the original problem space to a reduced problem space over which the original inference algorithm can now be applied much more efﬁciently. Speciﬁcally, our description in this section is entirely algorithm independent. We will focus on MAP inference which is the inference task of choice for most vision applications (refer Section 2). The key insight in our formulation is based on the realization that pixels which are involved in the same (or similar)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

kinds of unary and higher order potentials, and have the same (or similar) neighborhoods, are likely to have the same MAP value. Therefore, if somehow we could discover such sets of pixels a priori, we could explicitly enforce these pixels to have the same value while searching for the solution, substantially reducing the problem size and still preserving the optimal MAP assignment(s). Since in general doing this exactly may lead to a degenerate network, we do it approximately. Hence trading-off speed for marginal loss in solution quality. The loss in solution quality is offset by resorting to coarseto-ﬁne inference where we start with a crude approximation, and gradually make it ﬁner, to guarantee optimality at the end while still obtaining signiﬁcant gains. We next describe the details of our approach.

3.1 Obtaining a Reduced Problem

Consider an energy minimization problem over a PGM G = (X, γ). Let L = {1, 2, , |L|} denote the set of labels over which variables in the set X can vary. Let YP = {Y P 1 , Y P 2 , , Y P r } denote a partition of X into r disjoint subsets, i.e., k, Y P k X, Y P k1 Y P k2 = when k1 = k2, and k Y P k = X. We refer to each Y P k as a partition element. Correspondingly, let us deﬁne Y = {Y1, Y2, , Yr} as a set of partition variables, where there is a one to one correspondence between partition elements and the partition variables and each partition variable Yk takes values in the set L. Let part(Xi) denote the partition element to which Xi belongs. Let ˆXj X denote a subset of variables. We say that a partition element Y P k is represented in the set ˆXj if Xi ˆXj s.t. part(Xj) = Yk. Given a subset of variables ˆXj, let γj( ˆXj) be a potential deﬁned over ˆXj. Let ˆxj denote an assignment to variables in the set ˆXj. Let ˆxj.elem(i) denote the value taken by a variable Xi in ˆXj. We say that an assignment ˆXj = ˆxj respects a partition YP if the variables in ˆ Xj belonging to the same partition element have the same label in ˆxj, i.e., part(Xi) = part(Xi ) ˆxj.elem(i) = ˆxj.elem(i ), Xi, Xi ˆXj. Next, we introduce the notion of a reduced potential.

Deﬁnition 3.1 Let X be a set of variables and let YP denote its partition. Given the potential γj( ˆXj), the reduced potential Γj is deﬁned to be the restriction of γj( ˆXj) to those labeling assignments of ˆXj which respect the partition YP . Equivalently, we can deﬁne the reduced potential Γj( ˆYj) over the set of partition variables ˆYj which are represented in the set ˆXj.

For example, consider a potential γ(X1, X2, X3) deﬁned over three Boolean variables. The table for γ would have 8 entries. Consider the partition YP = {Y P 1 , Y P 2 } where Y P 1 = {X1, X2} and Y P 2 = {X3}. Then, the reduced potential Γ is the restriction of γ to those rows in the table where X1 = X2. Hence Γ has four rows in its table and equivalently can be thought of deﬁning a potential over the 4 possible combinations of Y1 and Y2 variables. We are now ready to deﬁne a reduced graphical model.

Deﬁnition 3.2 Let G = (X, γ) represent a PGM. Given a partition YP of X, the reduced graphical model G(Y, Γ) is the graphical model deﬁned over the set of partition variables Y such that every potential γj γ in G is replaced by the corresponding reduced potential Γj Γ in G.

Let E(x) and E(y) denote the energies of the states x and y in G and G, respectively. The following theorem relates the energies of the states in the two graphical models.

Theorem 3.1 For every assignment y of Y in G, there is a corresponding assignment x of X such that E(y) = E(x).

The theorem can be proved by noting that each potential Γj( ˆYj) in G was obtained by restricting the original potential γj( ˆXj) to those assignments where variables in Xj belonging to the same partition took the same label. Since this correspondence is true for every potential in the reduced set, to obtain the desired state x, for every variable Xi X we simply assign it the label of its partition in y. Corollary 3.1 Let x MAP and y MAP be the MAP states (i.e. having the minimum energy) for G and G, respectively. Then, E(y MAP) E(x MAP). The process of reduction can be seen as curtailing the entire search space to those assignments where variables in the same partition take the same label. A reduction in the problem space will lead to computational gains but might result in loss of solution quality, where the solution quality can be captured by the difference between E(y MAP) and E(x MAP). Therefore, we need to trade-off the balance between the two. Intuitively, a good problem reduction will keep those variables in the same partition which are likely to have the same value in the optimal assignment for the original problem. How do we ﬁnd such variables without actually solving the inference task? We will describe one such technique in Section 3.3. There is another perspective. Instead of solving one reduced problem, we can instead work with a series of reduced problems which successively get closer to the optimal solution. The initial reductions are coarser and far from optimal, but can be solved efﬁciently to quickly reach in the region where the solution lies. The successive iterations can then reﬁne the solution iteratively getting closer to the optimal. This leads us to the coarse-to-ﬁne inference described next.

3.2 Coarse to Fine Inference We will deﬁne a framework for C2F (coarse-to-ﬁne) inference so that we maintain the computational advantage while still preserving optimality. In the following, for ease of notation, we will drop the superscript P in Yp to denote the partition of X. Therefore, Y will refer to both the partition as well as the set of partition variables. Before we describe our algorithm, let us start with some deﬁnitions. Deﬁnition 3.3 Let Y and Y be two partitions of X. We say that Y is coarser than Y , denoted as Y Y , if y Y y Y such that y y. We equivalently say that Y is ﬁner than Y. It is easy to see that X deﬁnes a partition of itself which is the ﬁnest among all partitions, i.e., Y such that Y is a partition

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

of X, Y X. We also refer it to as the degenerate partition. For ease of notation, we will denote the ﬁnest partition by Y (same as X). We will refer to the corresponding PGM as G (same as G). Next, we state a theorem which relates two partitions with each other.

Lemma 1 Let Y and Y be two partitions of X such that Y Y . Then Y can be seen as a partition of the set Y.

The proof of this lemma is straightforward and is omitted due to lack of space. Consider a set Y of coarse to ﬁne partitions given as Y0 Y1, , , Yt, , , Y . Let Gt, Et, yt MAP respectively denote the reduced problem, energy function and MAP assignment for the partition Yt. Using Lemma 1, Yt+1 is a partition of Yt. Then, using Theorem 3.1, we have for every assignment yt to variables in Yt, there is an assignment yt+1 to variables in Yt+1 such that Et(yt) = Et+1(yt+1). Also, using Corollary 3.1, we have t Et(yt MAP) Et+1(yt+1 MAP). Together, these two statements imply that starting from the coarsest partition, we can gradually keep on improving the solution as we move to ﬁner partitions. Our C2F set-up assumes an iterative MAP inference algorithm A which has the anytime property i.e., can produce solutions of increasing quality with time. C2F Function (see Algorithm 1) takes 3 inputs: a set of C2F partitions Y, inference algorithm A, and a stopping criteria C. The algorithm A in turn takes three inputs: PGM Gt, starting assignment yt, stopping criteria C. A outputs an approximation to the MAP solution once the stopping criteria C is met. Starting with the coarsest partition (t = 0 in line 2), a start state is picked for the coarsest problem to be solved (line 3). In each iteration (line 4), C2F ﬁnds the MAP estimate for the current problem (Gt) using algorithm A (line 5). This solution is then mapped to a same energy solution of the next ﬁner partition (line 6) which becomes the starting state for the next run of A. The solution is thus successively reﬁned in each iteration. The process is repeated until we reach the ﬁnest level of partition. In the end, A is run on the ﬁnest partition and the resultant solution is output (lines 9,10). Since the last partition in the set is the original problem G , optimality with respect to A is guaranteed. Next, we describe how to use the color passing algorithm (Section 2) to get a series of partitions which get successively ﬁner. Our C2F algorithm can then be applied on this set of partitions to get anytime solutions of high quality while being computationally efﬁcient.

Algorithm 1 Coarse-to-Fine Lifted MAP Algorithm

1:C2F Lifted MAP(C2F Partitions Y, Algo A,Criteria C) 2: t = 0; T = |Y|; 3: yt = get Init State(Gt); 4: While (t < T); 5: yt MAP = A(Gt, yt, C); 6: yt+1 = get Equiv Assignment(Yt, Yt+1, yt MAP); 7: t = t + 1; 8: END While 9: y T MAP = A(GT , y T , C); 10: return y T MAP

3.3 C2F Partitioning for Computer Vision We now adapt the general color passing algorithm to MRFs for CV problems. Unfortunately, unary potentials make color passing highly ineffective. Different pixels have different RGB values and intensities, leading to almost every pixel getting a different unary potential. Naive application of color passing splits almost all variables into their own partitions, and lifting offers little value. A natural approximation is to deﬁne a threshold, such that two unary potentials within that threshold be initialized with the same color. Our experiments show limited success with this scheme because because two pixels may have the same label even when their actual unary potentials are very different. What is more important is relative importance given to each label than the actual potential value. In response, we adapt color passing for CV by initializing it as before, but with one key change: we initialize two unary potential nodes with the same color if their lowest energy labels have the same order for the top NL labels (we call this unary split threshold). Experiments reveal that this approximation leads to effective partitions for lifted inference. Finally, we can easily construct a sequence of coarse-toﬁne partitions in the natural course of color passing s execution every iteration of color passing creates a ﬁner partition. Moreover, as an alternative approach, we may also increase NL. In our implementations, we intersperse the two, i.e., before every next step we pick one of two choices: either, we run another iteration of color passing; or, we increase NL by one, and split each variable partition based on the N th L lowest energy labels of its constituent variables. We parameterize CP(NL, Niter) to denote the partition from the current state of color passing, which has been run till Niter iterations and unary split threshold is NL. It is easy to prove that another iteration of color passing or splitting by increasing NL as above leads to a ﬁner partition. I.e., CP(NL, Niter) CP(NL + 1, Niter) and CP(NL, Niter) CP(NL, Niter + 1). We refer to each element of a partition of variables as a lifted pixel, since it is a subset of pixels.

4 Lifted Inference for Stereo Matching

We ﬁrst demonstrate the value of lifted inference in the context of stereo matching [Scharstein and Szeliski, 2002]. It aims to ﬁnd pixel correspondences in a set of images of the same scene, which can be used to further estimate the 3D scene. Formally, two images Il and Ir corresponding to images of the scene from a left camera and a right camera are taken such that both cameras are at same horizontal level. The goal is to compute a disparity labeling Dl for every pixel X = (a, b) such that Il[a][b] corresponds to Ir[a Dl[a][b]][b]. We build a lifted version of TSGO [Mozerov and van de Weijer, 2015], as it is MRF-based and ranks 2nd on the Middlebury Stereo Evaluation Version 2 leaderboard.3

Background on TSGO: TSGO treats stereo matching as a two-step energy minimization, where the ﬁrst step is on a

3http://vision.middlebury.edu/stereo/eval/

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 1: (a) Average (normalized) energy vs. inference time (b) Average pixel error vs. time. C2F TSGO achieves roughly 60% reduction in time for reaching the optima. It has best anytime performance compared to vanilla TSGO and static lifted versions. (c) Average (normalized) energy vs. time for different thresholding values and CP partitions. Plots with the same marker have MRFs of similar sizes.

Figure 2: Qualitative results for Doll image at convergence. C2F-TSGO is similar to base TSGO.(a) Left and Right Images (b) Ground Truth (c) Disparity Map by TSGO (d) Disparity Map by C2F TSGO (e) Each colored region (other than black) is one among the 10 largest partition elements from CP(1,1). Each color represents one partition element. Partition elements form non-contiguous regions

fully connected MRF with pairwise potentials and the second is on a conventional locally connected MRF. Lack of space precludes a detailed description of the ﬁrst step. At the high level, TSGO runs one iteration of message passing on fully connected MRF, computes marginals of each pixel X, which act as unary potentials φ(X) for the MRF of second step. The pairwise potential ψ used in step two is ψ(X, X ) = w(X, X )ϕ(X, X ), where ϕ(X, X ) is a truncated linear function of X X , and w(X, X ) takes one of three distinct values depending on color difference between pixels. The MAP assignment x MAP computes the lowest energy assignment of disparities Dl for every pixel for this MRF. Lifted TSGO: Since step two is costlier, we build its lifted version as discussed in previous section. For color passing, two unary potential nodes are initialized with the same color if their lowest energy labels exactly match (NL = 1). Other initializations are consistent with original color passing for general MRFs. A sequence of coarse-to-ﬁne models is outputted as per Section 3.3. C2F TSGO uses outputs from the sequence CP(1, 1), CP(2, 1), CP(3, 1) and then reﬁnes to the original MRF. Model reﬁnement is triggered whenever energy hasn t decreased in the last four iterations of alpha expansion (this becomes the stopping criteria C in Algorithm 1). Experiments: Our experiments build on top of the existing TSGO implementation4, but we change the minimization algorithm in step two to alpha expansion fusion [Lempitsky et al., 2010] from Open GM2 library [Andres et al., 2010; Kappes et al., 2015], as it improves the speed of the base implementation. We use the benchmark Middlebury Stereo

4http://www.cvc.uab.es/ mozerov/Stereo/

datasets of 2003, 2005 and 2006 [Scharstein and Szeliski, 2003; Hirschmuller and Scharstein, 2007]. For the 2003 dataset, quarter-size images are used and for others, third-size images are used. The label space is of size 85 (85 distinct disparity labels). We compare our coarse-to-ﬁne TSGO (using CP(NL, Niter) partitions) against vanilla TSGO. Figures 1(a,b) show the aggregate plots of energy (and error) vs. time. We observe that C2F TSGO reaches the same optima as TSGO, but in less than half the time. It has a much superior anytime performance if inference time is given as a deadline, C2F TSGO obtains 59.59% less error on average over randomly sampled deadlines. We also eyeball the outputs of C2F TSGO and TSGO and ﬁnd them to be visually similar. Figure 2 shows a sample qualitative comparison. Figure 2(e) shows ﬁve of the ten largest partition elements in the partition from CP(1, 1). Clearly, the partition elements formed are not contiguous, and seem to capture variables that are likely to get the same assignment. This underscores the value of our lifting framework for CV problems. We also compare our CP(NL, Niter) partitioning strategy with threshold partitioning discussed in Section 3.3. We merge two pixels in thresholding scheme if the L1-norm distance of their unary potentials is less than a threshold. For each partition induced by our approach, we ﬁnd a value of threshold that has roughly the same number of lifted pixels. Figure 1(c) shows that partitions based on CP(1, 1) and CP(3, 1) converges to a much lower energy quickly compared to the corresponding threshold values (Thr = 50 and Thr = 1 respectively). For CP(2, 1), convergence is slower compared to corresponding threshold (Thr = 5) but eventually CP(2, 1) has signiﬁcantly better quality.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 3: (a-c) Qualitative Results for Segmentation. C2F has quality similar to Co GC algorithm (a) Original Image (b) Segmentation by Co GC (c) Segmentation by C2F Co GC (d) C2F Co GC has lower energy compared to Co GC and other lifted variant at all times

5 Lifted Inference for Image Segmentation

We now demonstrate the general nature of our lifted CV framework by applying it to a second task. We choose multilabel interactive image segmentation, where the goal is to segment an image I based on a seed labeling (true labels for a few pixels) provided as input. Like many other CV problems, this also has an MRF-based solution, with the best label-assignment generally obtained by MAP inference using graph cuts or loopy belief propagation [Boykov et al., 2001; Szeliski et al., 2008]. However, MRFs with only pairwise potentials are known to suffer from short-boundary bias they prefer segmentations with shorter boundaries, because pairwise potentials penalize every pair of boundary pixels. This leads to incorrect labeling for sharp edge objects. Kohli et al. [2013] use Co GC, cooperative graph cuts [Jegelka and Bilmes, 2011], to develop one of the best MRF-based solvers that overcome this bias. Background on Co GC: Traditional MRFs linearly penalize the number of label discontinuities at edges (boundary pixel pairs), but Co GC penalizes the number of types of label discontinuities through the use of a concave energy function over groups of ordered edges. It ﬁrst clusters all edges on the basis of color differences, and later applies a concave function separately over the number of times a speciﬁc discontinuity type is present in each edge group g G. Their carefully engineered Co GC energy function is as follows:

i=1 φi(xi)+ X

(x,x ) g w(x, x ).I(x = l, x = l)

where unary potentials φ depend on colors of seed pixels, F is a concave function, I the indicator function, and w(x, x ) depends on the color difference between x, x . Intuitively,

F collects all edges with similar discontinuities and penalizes them sub-linearly, thus reducing the short-boundary bias in the model. The usage of a concave function makes the MRF higher order with cliques over edge groups. However, the model is shown to reduce to a pairwise hierarchical MRF through the addition of auxiliary variables. Lifted Co GC: Co GC is lifted using the framework of Section 3, with one additional change. We cluster edge groups using color difference and the position of the edge. Edge groups that are formed only on the basis of color difference make the error of grouping different segment s boundaries into a single group. For e.g., it erroneously cluster boundaries between white cow and grass, and sky and grass together in the top image in Figure 3. Coarse-to-ﬁne partitions are obtained by the method described in Section 3.3. C2F Co GC uses outputs from the sequence CP( L

2 , 2), CP( L

2 , 3) before reﬁning to the original MRF. Model reﬁnement is triggered if energy has not reduced over the last |L| iterations. Experiments: Our experiments use the implementation of Cooperative Graph Cuts as provided by [Kohli et al., 2013].5 Energy minimization is performed using alpha expansion [Boykov et al., 2001]. The implementation of Co GC performs a greedy descent on auxiliary variables while performing alpha expansion on the remaining variables, as described in Kohli et. al. [2013]. The dataset used is provided with the implementation. It is a part of the MSRC V2 dataset.6. Figure 3 shows three individual energy vs. time plots. Re-

5Available at https://github.com/aosokin/coop Cuts CVPR2013 6Available at https://www.microsoft.com/en-us/research/ project/image-understanding/?from=http%3A%2F%2Fresearch. microsoft.com%2Fvision%2Fcambridge%2Frecognition%2F

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

sults on other images are similar. We ﬁnd that C2F Co GC algorithm converges to the same energy as Co GC in about two-thirds the time on average. Overall, C2F Co GC achieves a much better anytime performance than other lifted and unlifted Co GC. Similar to Section 4, reﬁned partitions attain better quality than coarser ones at the expense of time. Since the implementation performs a greedy descent over auxiliary variables, reﬁnement of current partition also resets the auxiliary variables to the last value that produced a change. Notice that energy minimization on output of CP(2, 3) attains a lower energy than on CP(3, 2). This observation drives our decision to reﬁne by increasing Niter. Qualitatively, C2F Co GC produces the same labeling as Co GC. Finally, similar to stereo matching, partitions based on thresholding scheme perform signiﬁcantly worse compared to CP(NL, Niter) for image segmentation as well.

6 Related Work

There is a large body of work on exact lifting, both marginal [Kersting et al., 2009; Gogate and Domingos, 2011; Niepert, 2012; Mittal et al., 2015] and MAP [Kersting et al., 2009; Gogate and Domingos, 2011; Niepert, 2012; Sarkhel et al., 2014; Mittal et al., 2014], which is not directly applicable to our setting. There is some recent work on approximate lifting [Van den Broeck and Darwiche, 2013; Venugopal and Gogate, 2014; Singla et al., 2014; Sarkhel et al., 2015; Van den Broeck and Niepert, 2015] but it s focus is on marginal inference whereas we are interested in lifted MAP. Further, this work can t handle a distinct unary potential on every node. An exception is work by Bui et al. [2012] which explicitly deals with lifting in presence of distinct unary potentials. Unfortunately, they make a very strong assumption of exchangeability in the absence of unaries which does not hold true in our setting since each pixel has its own unique neighborhood. Work by Sarkhel et al. [2015] is probably the closest to our work. They design a C2F hierarchy to cluster constants for approximate lifted MAP inference in Markov logic. In contrast, we partition ground atoms in a PGM. Like other work on approximate lifting, they can t handle distinct unary potentials. Furthermore, they assume that their theory is provided in a normal form, i.e., without evidence, which can be a severe restriction for most practical applications. Kiddon & Domingos [2011] also propose C2F inference for an underlying Markov logic theory. They use a hierarchy of partitions based on a pre-speciﬁed ontology. CV does not have any such ontology available, and needs to discover partitions using the PGM directly. Nath & Domingos [2010] exploit (approximate) lifted inference for video segmentation. They experiment on a speciﬁc video problem (different from ours), and they only compare against vanilla BP. Their initial partitioning scheme is similar to our thresholding approach, which does not work well in our experiments. In computer vision, a popular approach to reduce the complexity of inference is to use superpixels [Achanta et al., 2012; Van den Bergh et al., 2012]. Superpixels are obtained

by merging neighboring nodes that have similar characteristics. All pixel nodes in the same superpixel are assigned the same value during MAP inference. SLIC [Achanta et al., 2012] is one of the most popular algorithms for discovering superpixels. Our approach differs from SLIC in some signiﬁcant ways. First, their superpixels are local in nature whereas our algorithm can merge pixels that are far apart. This can help in merging two disconnected regions of the same object in a single lifted pixel. Second, they obtain superpixels independent of the inference algorithm, whereas we tightly integrate our lifting with the underlying inference algorithm. This can potentially lead to discovery of better partitions; indeed, this helped us tremendously in image segmentation. Third, they do not provide a C2F version of their algorithm and we did not ﬁnd it straightforward to extend their approach to discover successively ﬁner partitions. There is some recent work [Wei et al., 2016] which addresses last two of these challenges by introducing a hierarchy of superpixels. In our preliminary experiments, we found that SLIC and superpixel hierarchy perform worse than our lifting approach. Performing more rigorous comparisons is a direction for future work.

7 Conclusion and Future Work We develop a generic template for applying lifted inference to structured output prediction tasks in computer vision. We show that MRF-based CV algorithms can be lifted at different levels of abstraction, leading to methods for coarse to ﬁne inference over a sequence of lifted models. We test our ideas on two different CV tasks of stereo matching and interactive image segmentation. We ﬁnd that C2F lifting is vastly more efﬁcient than unlifted algorithms on both tasks obtaining a superior anytime performance, and without any loss in ﬁnal solution quality. To the best of our knowledge, this is the ﬁrst demonstration of lifted inference in conjunction with top of the line task-speciﬁc algorithms. Although we restrict to CV in this work, we believe that our ideas are general and can be adapted to other domains such as NLP, and computational biology. We plan to explore this in the future.

Acknowledgements We thank anonymous reviewers for their comments and suggestions. Ankit Anand is being supported by the TCS Research Scholars Program. Mausam is being supported by grants from Google and Bloomberg. Parag Singla is being supported by a DARPA grant funded under the Explainable AI (XAI) program. Both Mausam and Parag Singla are being supported by the Visvesvaraya Young Faculty Fellowships by Govt. of India. Any opinions, ﬁndings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reﬂect the views or ofﬁcial policies, either expressed or implied, of the funding agencies.

References [Achanta et al., 2012] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Ssstrunk. SLIC Superpixels Compared to State-ofthe-Art Superpixel Methods. In PAMI, Nov 2012. [Anand et al., 2016] A. Anand, A. Grover, Mausam, and P. Singla. Contextual Symmetries in Probabilistic Graphical Models. In IJCAI, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Anand et al., 2017] A. Anand, R. Noothigattu, P. Singla, and Mausam. Non-Count Symmetries in Boolean & Multi-Valued Prob. Graphical Models. In AISTATS, 2017. [Andres et al., 2010] B. Andres, J. H. Kappes, U. K othe, C. Schn orr, and F. A. Hamprecht. An Empirical Comparison of Inference Algorithms for Graphical Models with Higher Order Factors Using Open GM. In Pattern Recognition. 2010. [Blei et al., 2003] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 3, March 2003. [Boykov et al., 2001] Y. Boykov, O. Veksler, and R. Zabih. Fast Approximate Energy Minimization via Graph Cuts. In PAMI, 23(11), November 2001. [Braz et al., 2005] R. Braz, E. Amir, and D. Roth. Lifted First Order Probabilistic Inference. In IJCAI, 2005. [Bui et al., 2012] H. Bui, T. Huynh, and R. De Salvo Braz. Exact Lifted Inference with Distinct Soft Evidence on Every Object. In AAAI, 2012. [Bui et al., 2013] H. Bui, T. Huynh, and S. Riedel. Automorphism Groups of Graphical Models and Lifted Variational Inference. In UAI, 2013. [Freeman et al., 2000] W. Freeman, E. Pasztor, and O. Carmichael. Learning Low-Level Vision. In IJCV, 40, 2000. [Friedman, 2004] N. Friedman. Inferring Cellular Networks using Probabilistic Graphical Models. Science, 303, 2004. [Gogate and Domingos, 2011] V. Gogate and P. Domingos. Probabilisitic Theorem Proving. In UAI, 2011. [Hirschmuller and Scharstein, 2007] H. Hirschmuller and D. Scharstein. Evaluation of Cost Functions for Stereo Matching. In CVPR, 2007. [Jegelka and Bilmes, 2011] S. Jegelka and J. Bilmes. Submodularity Beyond Submodular Energies: Coupling Edges in Graph Cuts. In CVPR, 2011. [Jernite et al., 2015] Y. Jernite, A. Rush, and D. Sontag. A Fast Variational Approach for Learning Markov Random Field Language Models. In ICML, 2015. [Jha et al., 2010] A. Jha, V. Gogate, A. Meliou, and D. Suciu. Lifted Inference Seen from the Other Side : The Tractable Features. In NIPS, 2010. [Kappes et al., 2015] J. Kappes, B. Andres, A. Hamprecht, C. Schn orr, S. Nowozin, D. Batra, S. Kim, B. Kausler, T. Kr oger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems. In IJCV, 2015. [Kersting et al., 2009] K. Kersting, B. Ahmadi, and S. Natarajan. Counting Belief Propagation. In UAI, 2009. [Kersting, 2012] K. Kersting. Lifted Probabilistic Inference. In ECAI, 2012. [Kiddon and Domingos, 2011] C. Kiddon and P. Domingos. Coarse-to-Fine Inference and Learning for First-Order Probabilistic Models. In AAAI, 2011. [Kimmig et al., 2015] A. Kimmig, L. Mihalkova, and L. Getoor. Lifted Graphical Models: A Survey. Machine Learning, 2015. [Kohli et al., 2013] P. Kohli, A. Osokin, and S. Jegelka. A Principled Deep Random Field Model for Image Segmentation. In CVPR, 2013. [Lempitsky et al., 2010] V. Lempitsky, C. Rother, S. Roth, and A. Blake. Fusion Moves for Markov Random Field Optimization. In PAMI, Aug 2010.

[Mittal et al., 2014] H. Mittal, P. Goyal, V. Gogate, and P. Singla. New Rules for Domain Independent Lifted MAP Inference. In NIPS, 2014. [Mittal et al., 2015] H. Mittal, A. Mahajan, V. Gogate, and P. Singla. Lifted Inference Rules With Constraints. In NIPS, 2015. [Mladenov et al., 2014] M. Mladenov, K. Kersting, and A. Globerson. Efﬁcient Lifting of MAP LP Relaxations Using k-Locality. In AISTATS, 2014. [Mozerov and van de Weijer, 2015] M. G. Mozerov and J. van de Weijer. Accurate Stereo Matching by Two-Step Energy Minimization. IEEE Transactions on Image Processing, March 2015. [Nath and Domingos, 2010] A. Nath and P. Domingos. Efﬁcient Lifting for Online Probabilistic Inference. In AAAIWS, 2010. [Nath and Domingos, 2016] A. Nath and P. Domingos. Learning Tractable Probabilistic Models for Fault Localization. In AAAI, 2016. [Niepert, 2012] M. Niepert. Markov Chains on Orbits of Permutation Groups. In UAI, 2012. [Noessner et al., 2013] J. Noessner, M. Niepert, and H. Stuckenschmidt. Rock It: Exploiting Parallelism and Symmetry for MAP Inference in Statistical Relational Models. In AAAI, 2013. [Poole, 2003] D. Poole. First-Order Probabilistic Inference. In IJCAI, 2003. [Sarkhel et al., 2014] S. Sarkhel, D. Venugopal, P. Singla, and V. Gogate. Lifted MAP inference for Markov logic networks. In AISTATS, 2014. [Sarkhel et al., 2015] S. Sarkhel, P. Singla, and V. Gogate. Fast Lifted MAP Inference via Partitioning. In NIPS, 2015. [Scharstein and Szeliski, 2002] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. In IJCV, 2002. [Scharstein and Szeliski, 2003] D. Scharstein and R. Szeliski. High-accuracy Stereo Depth Maps Using Structured Light. In CVPR, 2003. [Singla and Domingos, 2008] P. Singla and P. Domingos. Lifted First-Order Belief Propagation. In AAAI, 2008. [Singla et al., 2014] P. Singla, A. Nath, and P. Domingos. Approximate Lifting Techniques for Belief Propagation. In AAAI, 2014. [Szeliski et al., 2008] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors. In PAMI, June 2008. [Van den Bergh et al., 2012] M. Van den Bergh, X. Boix, G. Roig, B. de Capitani, and L. Van Gool. SEEDS: Superpixels Extracted via Energy-Driven Sampling. In ECCV, 2012. [Van den Broeck and Darwiche, 2013] G. Van den Broeck and A. Darwiche. On the Complexity and Approximation of Binary Evidence in Lifted Inference. In NIPS, 2013. [Van den Broeck and Niepert, 2015] G. Van den Broeck and M. Niepert. Lifted Probabilistic Inference for Asymmetric Graphical Models. In AAAI, 2015. [Venugopal and Gogate, 2014] D. Venugopal and V. Gogate. Evidence-Based Clustering for Scalable Inference in Markov Logic. In Joint ECML-KDD, 2014. [Wei et al., 2016] X. Wei, Q. Yang, Y. Gong, M. Yang, and N. Ahuja. Superpixel Hierarchy. Co RR, abs/1605.06325, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)