# simple_map_inference_via_lowrank_relaxations__5b7ae615.pdf Simple MAP Inference via Low-Rank Relaxations Roy Frostig , Sida I. Wang, Percy Liang, Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA, 94305 {rf,sidaw,pliang}@cs.stanford.edu, manning@stanford.edu We focus on the problem of maximum a posteriori (MAP) inference in Markov random fields with binary variables and pairwise interactions. For this common subclass of inference tasks, we consider low-rank relaxations that interpolate between the discrete problem and its full-rank semidefinite relaxation. We develop new theoretical bounds studying the effect of rank, showing that as the rank grows, the relaxed objective increases but saturates, and that the fraction in objective value retained by the rounded discrete solution decreases. In practice, we show two algorithms for optimizing the low-rank objectives which are simple to implement, enjoy ties to the underlying theory, and outperform existing approaches on benchmark MAP inference tasks. 1 Introduction Maximum a posteriori (MAP) inference in Markov random fields (MRFs) is an important problem with abundant applications in computer vision [1], computational biology [2], natural language processing [3], and others. To find MAP solutions, stochastic hill-climbing and mean-field inference are widely used in practice due to their speed and simplicity, but they do not admit any formal guarantees of optimality. Message passing algorithms based on relaxations of the marginal polytope [4] can offer guarantees (with respect to the relaxed objective), but require more complex bookkeeping. In this paper, we study algorithms based on low-rank SDP relaxations which are both remarkably simple and capable of guaranteeing solution quality. Our focus is on MAP in a restricted but common class of models, namely those over binary variables coupled by pairwise interactions. Here, MAP can be cast as optimizing a quadratic function over the vertices of the n-dimensional hypercube: maxx2{ 1,1}n x TAx. A standard optimization strategy is to relax this integer quadratic program (IQP) to a semidefinite program (SDP), and then round the relaxed solution to a discrete one achieving a constant factor approximation to the IQP optimum [5, 6, 7]. In practice, the SDP can be solved efficiently using low-rank relaxations [8] of the form max X2Rn k tr(X>AX). The first part of this paper is a theoretical study of the effect of the rank k on low-rank relaxations of the IQP. Previous work focused on either using SDPs to solve IQPs [5] or using low-rank relaxations to solve SDPs [8]. We instead consider the direct link between the low-rank problem and the IQP. We show that as k increases, the gap between the relaxed low-rank objective and the SDP shrinks, but vanishes as soon as k rank(A); our bound adapts to the problem A and can thereby be considerably better than the typical data-independent bound of O(pn) [9, 10]. We also show that the rounded objective shrinks in ratio relative to the low-rank objective, but at a steady rate of (1/k) on average. This result relies on the connection we establish between IQP and low-rank relaxations. In the end, our analysis motivates the use of relatively small values of k, which is advantageous from both a solution quality and algorithmic efficiency standpoint. Authors contributed equally. The second part of this paper explores the use of very low-rank relaxation and randomized rounding (R3) in practice. We use projected gradient and coordinate-wise ascent for solving the R3 relaxed problem (Section 4). We note that R3 interfaces with the underlying problem in an extremely simple way, much like Gibbs sampling and mean-field: only a black box implementation of x 7! Ax is required. This decoupling permits users to customize their implementation based on the structure of the weight matrix A: using GPUs for dense A, lists for sparse A, or much faster specialized algorithms for A that are Gaussian filters [11]. In contrast, belief propagation and marginal polytope relaxations [2] need to track messages for each edge or higher-order clique, thereby requiring more memory and a finer-grained interface to the MRF that inhibits flexibility and performance. Finally, we introduce a comparison framework for algorithms via the x 7! Ax interface, and use it to compare R3 with annealed Gibbs sampling and mean-field on a range of different MAP inference tasks (Section 5). We found that R3 often achieves the best-scoring results, and we provide some intuition for our advantage in Section 4.1. 2 Setup and background Notation We write Sn for the set of symmetric n n real matrices and Sk for the unit sphere {x 2 Rk : kxk2 = 1}. All vectors are columns unless stated otherwise. If X is a matrix, then Xi 2 R1 k is its i th row. This section reviews how MAP inference on binary graphical models with pairwise interactions can be cast as integer quadratic programs (IQPs) and approximately solved via semidefinite relaxations and randomized rounding. Let us begin with the definition of an IQP: Definition 2.1. Let A 2 Sn be a symmetric n n matrix. An (indefinite) integer quadratic program (IQP) is the following optimization problem: max x2{ 1,1}n IQP(x) def = x TAx (1) Solving (1) is NP-complete in general: the MAX-CUT problem immediately reduces to it [5]. With an eye towards tractability, consider a first candidate relaxation: maxx2[ 1,1]n x TAx. This relaxation is always tight in that the maxima of the relaxed objective and original objective (1) are equal.1 Therefore it is just as hard to solve. Let us then replace each scalar xi 2 [ 1, 1] with a unit vector Xi 2 Rk and define the following low-rank problem (LRP): Definition 2.2. Let k 2 {1, . . . , n} and A 2 Sn. Define the low-rank problem LRPk by: max X2Rn k LRPk(X) def = tr(XTAX) subject to k Xik2 = 1, i = 1, . . . , n. Note that setting Xi = [xi, 0, . . . , 0] 2 Rk recovers (1). More generally, we have a sequence of successively looser relaxations as k increases. What we get in return is tractability. The LRPk objective generally yields a non-convex problem, but if we take k = n, the objective can be rewritten as tr(X>AX) = tr(AXX>) = tr(AS), where S is a positive semidefinite matrix with ones on the diagonal. The result is the classic SDP relaxation, which is convex: max S2Sn SDP(S) def = tr(AS) subject to S 0, diag(S) = 1 Although convexity begets easy optimization in a theoretical sense, the number of variables in the SDP is quadratic in n. Thus for large SDPs, we actually return to the low-rank parameterization (2). Solving LRPk via simple gradient methods works extremely well in practice and is partially justified by theoretical analyses in [8, 12]. 1Proof. WLOG, A 0 because adding to its diagonal merely adds a constant term to the IQP objective. The objective is a convex function, as we can factor A = LLT and write x TLLTx = k LTxk2 2, so it must be maximized over its convex polytope domain at a vertex point. To complete the picture, we need to convert the relaxed solutions X 2 Rn k into integral solutions x 2 { 1, 1}n of the original IQP (1). This can be done as follows: draw a vector g 2 Rk on the unit sphere uniformly at random, project each Xi onto g, and take the sign. Formally, we write x = rrd(X) to mean xi = sign(Xi g) for i = 1, . . . , n. This randomized rounding procedure was pioneered by [5] to give the celebrated 0.878-approximation of MAX-CUT. 3 Understanding the relaxation-rounding tradeoff The overall IQP strategy is to first relax the integer problem domain, then round back in to it. The optimal objective increases in relaxation, but decreases in randomized rounding. How do these effects compound? To guide our choice of relaxation, we analyze the effect that the rank k in (2) has on the approximation ratio of rounded versus optimal IQP solutions. More formally, let x?, X?, and S? denote global optima of IQP, of LRPk, and of SDP, respectively. We can decompose the approximation ratio as follows: 1 E[IQP(rrd(X?))] IQP(x?) | {z } approximation ratio IQP(x?) | {z } constant 1 SDP(S?) | {z } tightening ratio T (k) E[IQP(rrd(X?))] LRPk(X?) | {z } rounding ratio R(k) As k increases from 1, the tightening ratio T(k) increases towards 1 and the rounding ratio R(k) decreases from 1. In this section, we lower bound T and R each in turn, thus lower-bounding the approximation ratio as a function of k. Specifically, we show that T(k) reaches 1 at small k and that R(k) falls as 2 In practice, one cannot find X? for general k with guaranteed efficiency (if we could, we would simply use LRP1 to directly solve the original IQP). However, Section 5 shows empirically that simple procedures solve LRPk well for even small k. 3.1 The tightening ratio T(k) increases We now show that, under the assumption of A 0, the tightening ratio T(k) plateaus early and that it approaches this plateau steadily. Hence, provided k is beyond this saturation point, and large enough so that an LRPk solver is practically capable of providing near-optimal solutions, there is no advantage in taking k larger. First, T(k) is steadily bounded below. The following is a result of [13] (that also gives insight into the theoretical worst-case hardness of optimizing LRPk): Theorem 3.1 ([13]). Fix A 0 and let S? be an optimal SDP solution. There is a randomized algorithm that, given S?, outputs X feasible for LRPk such that E X[LRPk( X)] γ(k) SDP(S?), where Γ((k + 1)/2) For example, γ(1) = 2 = 0.6366, γ(2) = 0.7854, γ(3) = 0.8488, γ(4) = 0.8836, γ(5) = 0.9054.2 By optimality of X?, LRPk(X?) E X[LRPk( X)] under any probability distribution, so the existence of the algorithm in Theorem 3.1 implies that T(k) γ(k). Moreover, T(k) achieves its maximum of 1 at small k, and hence must strictly exceed the γ(k) lower bound early on. We can arrive at this fact by bounding the rank of the SDP-optimal solution S?. This is because S? factors into S? = XXT, where X is in Rn rank S? and must be optimal since LRPrank S?(X) = SDP(S?). Without consideration of A, the following theorem uniformly bounds this rank at well below n. The theorem was established independently by [9] and [10]: Theorem 3.2 ([9, 10]). Fix a weight matrix A. There exists an optimal solution S? to SDP (3) such that rank S? 2The function γ(k) generalizes the constant approximation factor 2/ = γ(1) with regards to the implications of the unique games conjecture: the authors show that no polynomial time algorithm can, in general, approximate LRPk to a factor greater than γ(k) assuming P 6= NP and the UGC. 1 2 3 4 5 6 0.65 rounding ratio R(k) lower bound (a) R(k) (blue) is close to it 2/( γ(k)) lower bound (red) across the small k. 1 2 3 4 5 6 γ(k) T(k)=LRPk/SDP (b) T (k) (blue), the empirical tightening ratio, clears its lower bound γ(k) (red) and hits its ceiling of 1 at k = 4. 1 2 3 4 5 6 1100 SDP Max Mean Mean+Std Mean Std (c) Rounded objective values vs. k: optimal SDP (cyan), best IQP rounding (green), and mean IQP rounding σ (black). Figure 1: Plots of quantities analyzed in Section 3, under A 2 R100 100 whose entries are sampled independently from a unit Gaussian. For this instance, the empirical post-rounding objectives are shown at the right for completeness. Hence we know already that the tightening ratio T(k) equals 1 by the time k reaches Taking A into consideration, we can identify a class of problem instances for which T(k) actually saturates at even smaller k. This result is especially useful when the rank of the weight matrix A is known, or even under one s control, while modeling the underlying optimization task: Theorem 3.3. If A is symmetric, there is an optimal SDP solution S? such that rank S? rank A. A complete proof is in Appendix A.1. Because adding to the diagonal of A is equivalent to merely adding a constant to the objective of all problems considered, Theorem 3.3 can be strengthened: Corollary 3.4. For any symmetric weight matrix A, there exists an optimal SDP solution S? such that rank S? minu2Rn rank(A + diag(u)). That is, changes to the diagonal of A that reduce its rank may be applied to improve the bound. In summary, T(k) grows at least as fast as γ(k), from T(k) = 0.6366 at k = 1 to T(k) = 1 at k = min{ 2n, minu2Rn rank(A + diag(u))}. This is validated empirically in Figure 1b. 3.2 The rounding ratio R(k) decreases As the dimension k grows for row vectors Xi in the LRPk problem, the rounding procedure incurs a larger expected drop in objective value. Fortunately, we can bound this drop. Even more fortunately, the bound grows no faster than γ(k), exactly the steady lower bound for T(k). We obtain this result with an argument based on the analysis of [13]: Theorem 3.5. Fix a weight matrix A 0 and any LRPk-feasible X 2 Rn k. The rounding ratio for X is bounded below as E[IQP(rrd(X))] LRPk(X) 2 γ(k) = 2 Note that X in the theorem need not be optimal the bound applies to whatever solution an LRPk solver might provide. The proof, given in Appendix section A.1, uses Lemma 1 from [13], which is based on the theory of positive definite functions on spheres [14]. A decrease in R(k) that tracks the lower bound is observed empirically in Figure 1a. In summary, considering only the steady bounds (Theorems 3.1 and 3.5), T will always rise opposite to R at least at the same rate. Then, the added fact that T plateaus early (Theorem 3.2 and Corollary 3.4) means that T in fact rises even faster. In practice, we would like to take k beyond 1 as we find that the first few relaxations give the optimizer an increasing advantage in arriving at a good LRPk solution, close to X? in objective. The rapid rise of T relative to R just shown then justifies not taking k much larger if at all. 4 Pairwise MRFs, optimization, and inference alternatives Having understood theoretically how IQP relates to low-rank relaxations, we now turn to MAP inference and empirical evaluation. We will show that the LRPk objective can be optimized via a simple interface to the underlying MRF. This interface then becomes the basis for (a) a MAP inference algorithm based on very low-rank relaxations, and (b) a comparison to two other basic algorithms for MAP: Gibbs sampling and mean-field variational inference. A binary pairwise Markov random field (MRF) models a function h over x 2 {0, 1}n given by h(x) = P i i(xi) + P i