# proteins_particles_and_pseudomaxmarginals_a_submodular_approach__3839f4df.pdf

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Jason L. Pacheco PACHECOJ@CS.BROWN.EDU Erik B. Sudderth SUDDERTH@CS.BROWN.EDU Department of Computer Science, Brown University, Providence, RI 02912, USA

Variants of max-product (MP) belief propagation effectively ﬁnd modes of many complex graphical models, but are limited to discrete distributions. Diverse particle max-product (D-PMP) robustly approximates max-product updates in continuous MRFs using stochastically sampled particles, but previous work was specialized to treestructured models. Motivated by the challenging problem of protein side chain prediction, we extend D-PMP in several key ways to create a generic MAP inference algorithm for loopy models. We deﬁne a modiﬁed diverse particle selection objective that is provably submodular, leading to an efﬁcient greedy algorithm with rigorous optimality guarantees, and corresponding maxmarginal error bounds. We further incorporate tree-reweighted variants of the MP algorithm to allow provable veriﬁcation of global MAP recovery in many models. Our general-purpose MATLAB library is applicable to a wide range of pairwise graphical models, and we validate our approach using optical ﬂow benchmarks. We further demonstrate superior side chain prediction accuracy compared to baseline algorithms from the state-of-the-art Rosetta package.

1. Introduction

Continous random variables are often used to model complex interactions among objects in the world around us, leading to challenging multi-modal posterior distributions. The maximum a posteriori (MAP) inference objective for such models is typically non-convex, and optimization algorithms become trapped in local optima. Approaches that discretize the latent space and apply max-product (MP) belief propagation (Pearl, 1988; Wainwright et al., 2005; Wainwright & Jordan, 2008) can be effective in few dimensions, but for high-dimensional models only coarse

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

discretizations are feasible. Continuous optimization can be performed via Monte Carlo sampling and simulated annealing (Geman & Geman, 1984; Andrieu et al., 2003), but these methods often require long computation times.

A number of stochastic local search methods have been developed (Trinh & Mc Allester, 2009; Peng et al., 2011; Besse et al., 2012; Kothapa et al., 2011) that combine the ﬂexibility of sampling-based approaches with the efﬁciency of MP message passing. This family of particle max-product (PMP) methods share a general framework: at each iteration new hypotheses are sampled from stochastic proposals, evaluated via discrete max-product message updates, and accepted or rejected based on some criterion. PMP algorithms differ primarily in their choice of stochastic proposals and particle selection criteria.

The diverse particle max-product (D-PMP) (Pacheco et al., 2014) algorithm maintains hypotheses near multiple local optima via an optimization-based selection step that minimizes distortions in MP message values. D-PMP has excellent empirical performance on a human pose estimation task, but there is little theoretical justiﬁcation for its particle selection integer program (IP), and the proposed greedy algorithm has no optimality guarantees. Previous D-PMP formulations also assumed a tree-structured Markov random ﬁeld (MRF) where MP provides exact max-marginals, and several key assumptions would be violated by a naive generalization to loopy graphical models.

In this paper, we generalize D-PMP to arbitrary pairwise MRFs with cycles by adapting tree-reweighted maxproduct (RMP) belief propagation (Wainwright et al., 2005). We deﬁne an alternative message distortion metric which leads to a submodular particle selection IP. An efﬁcient greedy algorithm is guaranteed to produce message errors within a fraction of (1 1

e) of the best achievable, and thus provide provably accurate max-marginal estimates. Our MATLAB library implements the D-PMP algorithm for general pairwise MRFs. For the tasks of optical ﬂow estimation and protein side chain prediction, we demonstrate substantial improvements over previous PMP algorithms, and performance levels that match or exceed state-of-the-art domain-speciﬁc inference algorithms.

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Figure 1. Loopy PMP ﬂowcharts. The high-level structure of three variants of loopy particle max-product: the Greedy PMP (G-PMP) of Peng et al. (2011), the Top-N PMP (T-PMP) of Besse et al. (2012), and the Diverse PMP (D-PMP) of Pacheco et al. (2014).

2. Background

We begin with a brief introduction to max-product inference for discrete MRFs, which forms the basis for our particle-based approximations. To ground these concepts we introduce the protein side chain prediction task, which will be used to motivate and validate our approach.

2.1. Max-Product Belief Propagation

Max-product belief propagation (Pearl, 1988; Aji & Mc Eliece, 2000; Wainwright & Jordan, 2008) performs MAP inference by passing messages along the edges of a graphical model. Consider a pairwise MRF, with edges (s, t) E and nodes s V:

s V ψs(xs) Y

(s,t) E ψst(xs, xt). (1)

In MRFs with cycles, tree-reweighted max-product (RMP) (Wainwright et al., 2005) approximates MAP inference via a set of spanning trees, with edge appearance probabilities ρst. The RMP message from node t to s is:

max xt ψt(xt)ψst(xs, xt)

u Γ(t)\s emut(xt)ρst

emst(xt)1 ρst , (2)

where Γ(t) is the set of nodes neighboring t. Fixed points yield pseudo-max-marginals, which do not necessarily correspond to valid max-marginal distributions:

eνs(xs) ψs(xs) Y

u Γ(s) emus(xs)ρus max {x |x s=xs} p(x ).

Via connections to linear programming relaxations, RMP provides a bound on the MAP probability at each iteration, and a certiﬁcate of optimality using Lagrange multipliers.

2.2. Particle Max-Product Belief Propagation

For continuous variables x X, the message functions of Eq. (2) cannot be computed in general. Particle maxproduct (PMP) methods approximate messages by optimizing over a discrete set of particles found via stochastic search. Given a current set of N particles Xt Xt, each PMP iteration has three stages, summarized in Figure 1.

Stochastic Proposals To allow higher-likelihood state conﬁgurations to be discovered, at each iteration PMP ﬁrst creates an augmented particle set Xaug = X Xprop of size αN, α > 1. New particles are drawn from proposal distributions Xprop q(X). In the simplest case, Gaussian random walk proposals qgauss(xs) = N(xs | xs, Σ) sample perturbations of current particle locations xs (Trinh & Mc Allester, 2009; Peng et al., 2011). For some models, a more informative neighbor-based proposal is possible that samples from edge potentials qnbr(xs | xt) ψst(xs, xt) conditioned on a particle xt at neighboring node t Γ(s) (Besse et al., 2012). Specialized bottom-up proposals based on approximations of observation potentials ψs(xs) can also be effective (Pacheco et al., 2014). Max-Product Optimization Standard or reweighted MP message updates are used to approximate the max-marginal distribution of each proposed particle. The αN values of each discrete message vector satisfy mts(xs) =

max xt Xt ψt(xt)ψst(xs, xt)

u Γ(t)\smut(xt)ρst

mst(xt)1 ρst .

Message updates require O(α2N 2) operations, and compute the pseudo-max-marginal νs(xs) for each xs Xaug. Particle Selection Particles are accepted or rejected to yield N new states Xnew Xaug. Particle selection makes subsequent iterations more computationally efﬁcient.

The simple greedy PMP (G-PMP) method selects the single particle x s = arg maxxs Xaug s νs(xs) with the highest maxmarginal, and samples all other particles as Gaussian perturbations of this state (Trinh & Mc Allester, 2009; Peng et al., 2011). G-PMP updates are efﬁcient, but they cannot preserve multiple modes, and random walk proposals do not effectively explore high-dimensional spaces.

A less greedy selection method retains the N particles with highest estimated max-marginal probability. This top-N PMP (T-PMP) (Pacheco et al., 2014) generalizes Patch Match BP (Besse et al., 2012), a method specialized to low-level vision tasks which utilizes top-N particle selection and neighbor proposals. T-PMP ﬁnds high probability solutions quickly, but the top-N particles are often slight perturbations of the same solution, reducing the number of effective particles and causing sensitivity to initialization.

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Figure 2. Protein side chain. Left: A simple protein with two amino acids forms a backbone (thick) and side chains (thin). Right: A regular discretization of the ﬁrst dihedral angle.

To avoid the particle degeneracy common to G-PMP and T-PMP, the diverse PMP (D-PMP) (Pacheco et al., 2014) method selects particles via combinatorial optimization. An IP favors particles that minimally distort the current MP messages, and thus implicitly encourages diversity. By preserving solutions near multiple local optima, D-PMP reasons more globally and is less sensitive to initialization.

2.3. Protein Side Chain Prediction

Predicting protein structure, which is governed by pairwise energetic interactions, is a natural target for PMP algorithms (Peng et al., 2011; Soltan Ghoraie et al., 2013). Proteins are formed by chains of amino acids which consist of a backbone and a side chain unique to each amino acid type. Each protein assumes a 3D structure, or conformation, relating to its function. Given an amino acid sequence and a ﬁxed backbone geometry, side chain prediction methods estimate the atomic coordinates of all side chain atoms.

We model side chain prediction as MAP inference in a pairwise MRF (Yanover & Weiss, 2002). The latent space is speciﬁed in terms of dihedral angles x Rd, which describe the relative orientation between two planes (Fig. 2). The number of dihedral angles d varies by amino acid type. Energy is more easily modeled in terms of inter-atomic distance rij(x) between pairs of atoms i, j. Pairwise terms encode interactions between nearby side chains via the attractive and repulsive components of the 6-12 Lennard Jones log-potential:

log ψst(xs, xt) =

Here Ns is the number of atoms in the sth amino acid, ϵ controls the strength of attraction and σ the cutoff distance where atoms do not interact. Log-likelihoods log ψs(xs) are given by a Gaussian mixture ﬁt to the marginal statistics of observed dihedral angles. More details are in Sec. 4.2.

3. Loopy Diverse Particle Max-Product

The D-PMP message updates of Pacheco et al. (2014) can be directly applied to loopy MRFs, since each step decomposes into local computations on the graph. But, a naive extension may have convergence problems like those ob-

served for loopy MP in many discrete models. Using RMP message passing, combined with our method for resolving ties, we can verify that global optimality is achieved and ensure that the MAP estimate is nondecreasing. We also introduce a new IP objective in the particle selection step which is a monotonic submodular maximization. This IP allows us to use a standard greedy algorithm for particle selection, and attain a multiplicative optimality bound.

3.1. Submodular Particle Selection

For each node t V we select particles to minimize the distortion between two message vectors. Speciﬁcally, we choose a subset of particles which minimizes the L1 norm,

s Γ(t) mts ˆmts(z) 1 (4)

subject to z 1 N, z {0, 1}αN.

The message vector mts is computed over the augmented particles Xaug t = {x(1) t , . . . , x(αN) t }, with α > 1. The message vector ˆmts(z) is computed over any subset of at most N particles Xnew t Xaug t indexed by the indicator vector z, mts(a) = max b {1,...,αN} Mst(a, b), (5)

ˆmts(a; z) = max b {1,...,αN} z(b)Mst(a, b). (6)

Here we have accumulated the terms needed for RMP message updates in a message foundation matrix Mst(a, b) =

ψt(x(b) t )ψst(x(a) s , x(b) t )

u Γ(t)\smut(b)ρst

mst(b)1 ρst . (7)

Pseudo-Max-Marginal Bounds Particles are chosen to minimize message distortions, but our primary goal is to maintain approximations of the pseudo-max-marginals:

νs(xs) ψs(xs) Y

u Γ(s) mus(xs)ρus, (8)

and analogously for pseudo-max-marginals on the selected particles ˆν. If the potentials are bounded above and normalized so that 0 ψ 1, then the sum of message distortions bounds the psuedo-max-marginal error. Proposition 1. Let 0 ˆm m 1 and edge appearance probabilities ρst [0, 1]. For all nodes s V we have:

t Γ(s) mts ˆmts ρts 1 (9)

We provide a proof in the Appendix. Note that we do not bound the difference between the D-PMP max-marginals νs(xs) and the continuous max-marginals eνs(xs); such results typically require strong and unrealistic assumptions (Peng et al., 2011). Instead, Eq. (9) shows that if we succeed in producing small message errors, the particle selection step will not signiﬁcantly distort the pseudo-maxmarginals, nor will it discard important hypotheses.

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Figure 3. LAZYGREEDY particle selection for side chain prediction for the simple two-amino-acid protein of Fig. 2. We ﬁx all but the ﬁrst dihedral angle and select 3 particles out of a regular grid of 30 angles. Left: Objective function for optimal IP, an LP relaxation, and LAZYGREEDY . Center: Message foundation matrix showing locations of all particles (black) and the three selected particles of the source node (green). Right: Augmented message (lower, blue), message approximations for the ﬁrst three particle selections (lower, green), and corresponding margins (upper); the selected particle at each step is the maximizer of the margin (red).

Submodularity The particle selection IP we propose minimizes the absolute sum of message differences (the L1 norm). In contrast, the selection objective proposed by Pacheco et al. (2014) minimizes the maximum message difference (the L norm). With this modiﬁed error metric, minimizing Eq. (4) is equivalent to maximizing a monotonic submodular function. Deﬁnition 1 (Submodularity). A set function f : 2Z R deﬁned over subsets of Z is submodular iff for any subsets Y X Z and an element e / X the function f satisﬁes, f(Y {e}) f(Y ) f(X {e}) f(X).

The quantity (Y, e) f(Y S{e}) f(Y ) is the margin, and Def. (1) states that for any e / X the margin is nonincreasing in |Y |. This property of diminishing marginal returns allows us to use efﬁcient greedy algorithms.

Proposition 2. The optimization of Equation (4) is equivalent to maximizing a monotonic submodular objective subject to cardinality constraints.

Proof. We focus on a single node and drop subscripts. Dropping constants, we can minimize Eq. (4) as follows:

arg max z: z 1 N

a Fa(z) = X

max 1 b N z(b)M(a, b) (10)

Let y, z {0, 1}αN be particle selections and y z such that y(b) = 1 z(b) = 1. For some candidate particle b:

y(b) = 1, if b = b y(b), o.w. z(b) = 1, if b = b z(b), o.w. The margins are given by direct calculation: Fa(y) Fa(y) = max(0, M(a, b) ˆm(a; y))

Fa(z) Fa(z) = max(0, M(a, b) ˆm(a; z)). Since y z we have that Fa is submodular, Fa(y) Fa(y) Fa(z) Fa(z). A sum of submodular functions is submodular, and monotonicity holds since ˆm(y) ˆm(z).

3.2. LAZYGREEDY Particle Selection

The LAZYGREEDY algorithm exploits diminishing marginal returns to avoid redundant computations (Minoux, 1978; Leskovec et al., 2007). Each iteration updates and sorts the largest margin until a stable maximizer is found. The algorithm terminates when the particle budget is exhausted, or the maximum margin is zero. Surprisingly, this greedy approach yields solutions within a factor (1 1

e) 0.63 of optimal (Nemhauser et al., 1978).

Initialize: For each node t let M = M T s1t, . . . , M T sdt T be the message foundations of neighbors Γ(t) = {s1, . . . , sd}. Initialize the selection vector z and margins:

a=1 M(a, b), z(b) = 0 b {1, αN}. (11)

First Iteration: Ensure that the current MAP estimate x is never discarded by setting z(b ) = 1, where b is the index of x t in the augmented particle set Xaug t (see Sec. 3.3). Update the message approximation ˆm(a) = M(a, b ).

Iterations 2 to N: Choose the largest margin to update, eb = arg max {b|z(b)=0} (b).

If (eb) = 0 then terminate prematurely, the message can be perfectly reconstructed with a subset of particles. If (eb) has already been updated on the current iteration then set z(eb) = 1 and update the message approximation,

ˆm(a) = max( ˆm(a), Mt(a,eb)).

Otherwise, update the margin and repeat,

h max( ˆm(a), M(a,eb)) ˆm(a) i .

Selections are performed in parallel and updates at one node do not affect the selection at neighboring nodes. Figure 3 graphically demonstrates LAZYGREEDY selection on the small toy protein of Fig. 2.

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

G-PMP T-PMP D-PMP

Figure 4. Label Conﬂicts. Above: Selected side chain particles of two amino acids (PDB: 1QOW). Diversity in the D-PMP particle set presents more opportunity for an inconsistent labeling. Below: Naively maximizing the node max-marginal over two tied states can produce a very unlikely joint conﬁguration.

3.3. Resolving ties

In PMP we resolve ties using an approach similar to one proposed for discrete MRFs (Weiss et al., 2007). For discrete models the RMP pseudo-max-marginals ν admit a provably MAP solution x if a consistent labeling exists in the set of maxima (Wainwright et al., 2005):

x s arg max xs νs(xs), (x s, x t ) arg max xs,xt νst(xs, xt).

For continuous distributions exact ties rarely exist, but small numerical errors in the estimated pseudo-maxmarginals can perturb the particle that is inferred to be most likely, and lead to joint states with low probability due to conﬂicted edges. This problem is common in the side chain model, and as illustrated in Fig. 4, the diversity in the D-PMP particles makes conﬂicts more likely. To address this we relax the set of optima to be states with pseudomax-marginal values within tolerance ϵ of the maximum:

OPT(νs) {x s : |νs(x s) arg max xs νs(xs)| ϵ}.

Let VT be the set of tied nodes with more than one nearmaximal state, and ET E (VT VT ) the edges joining them. Let x NT be the unique assignments for non-tied nodes. Construct an MRF over the remaining tied nodes as

p T (x T ) Y

(s,t) ET ψst(xs, xt), (12)

with the conditioned node potentials

eψs(xs) = ψs(xs) Y

t Γ(s)\VT ψst(xs, x t ). (13)

We label the remaining nodes x T = arg maxx T p T (x T ) using the junction tree algorithm. If the junction tree contains a unique maximizer, then x = (x T , x NT ) is the

0 500 1000 1500 2000 100

D PMP Iteration

Log Probability

500 1000 1500 2000

Log Probability

Log Probability RMP Bound

Figure 5. Primal & Dual Trajectories for a single protein (PDB: 1QOW) over all RMP iterations and 10 D-PMP steps; peaks indicate resampling. Left: Without resolving ties a MAP labeling is not obtained. Right: With tie resolution the duality gap vanishes.

global MAP over the particles X. This guarantee follows from the reparameterization property of pseudo-maxmarginals and Theorem 2 of Weiss et al. (2007). Clique size is reduced by eliminating non-tied nodes, and by constraining labels to the set of tied states x T OPT(νs).

4. Experimental Results

We consider two tasks that demonstrate the effectiveness and ﬂexibility of D-PMP inference. We begin with optical ﬂow estimation, a low-level vision task which recovers 2D pixel motion in an image sequence. Optical ﬂow is a well-studied problem where specialized inference methods are thought to be near-optimal for the model and dataset we consider, and so provide a good comparison. We then revisit our running example of protein side chain prediction, which is more challenging due to increased dimensionality and complex potentials. Many methods for side chain prediction make coarse discrete approximations to speed up computation, and we show signiﬁcant improvement using D-PMP to optimize the continuous energy model.

4.1. Optical Flow

Given a pair of (grayscale) images I1 and I2 in RM N, we estimate the motion of each pixel s from one image to the next. This ﬂow vector xs is decomposed into horizontal u and vertical v scalar components. The model presented below is based on the Classic-C method (Sun et al., 2014). To reduce the number of edges we model ﬂow at the superpixel level, holding ﬂow constant over each superpixel. Edges are given by the immediate neighbors in I1.

The pairwise log-potential enforces a smoothness prior on ﬂow vectors. We use the robust Charbonnier penalty, a differentiable approximation to L1, which is approximately quadratic in the range [ σ, σ] and smoothly transitions to a linear function outside this range. The potential decomposes additively as log ψst = φvert st + φhor st into vertical and horizontal components, deﬁned as follows:

φhor st (us, ut) = λs p

σ2 + (us ut)2. (14)

The spatial smoothness depends on scaling parameter λs.

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

G-PMP T-PMP D-PMP Ground Truth

Figure 6. Preserving multiple hypotheses. Top Row: Final ﬂow estimate of each method for the Rubber Whale sequence. The color key (top-right) encodes ﬂow vector orientation, color saturation denotes magnitude. Bottom Row: Detail of highlighted region showing selected ﬂow particles as vectors (black) and the MAP label (red). The MAP estimates of D-PMP and T-PMP have higher probability than ground truth, but D-PMP preserves the correct ﬂow in the particle set.

20 40 60 80

Log Probability

G PMP T PMP D PMP

2 4 6 8 10 0.25

Number of Particles

20 40 60 80 2

Log Probability

Log-Probability Oracle Avg. EPE Log-Prob. (Test) Pixel-Level Log-Prob. Figure 7. Optical ﬂow results. Left: Log-probability quantiles showing median (solid) and best/worst (dashed) MAP estimates versus PMP iteration for 11 random initializations on the Middlebury training set. Left-Center: Oracle AEPE over the training set. Right Center: Log-probability quantiles on the test set (G-PMP omitted due to poor performance on training). Right: Log-probability of the MAP estimates at the pixel-level model obtained by initializing L-BFGS at the D-PMP solution.

Likelihood potentials log ψs(xs) = φs(xs) assume brightness constancy: properly matched pixels should have similar intensities. Each superpixel s contains a number of pixels Is = {(i1, j1), . . . , (ik, jk)}, and for each pixel (i, j) we compute the warped coordinates (ei,ej) = (i+us, j+vs). The likelihood penalizes the difference in image intensities, again using the Charbonnier penalty:

φs(us, vs) = λd X

σ2 + (I1(i, j) I2(ei,ej))2 (15)

In computing the warped coordinates we also constrain any pixels which ﬂow outside the image boundary to be exactly on the boundary, ei = min(M, max(0, i + us)). We apply bicubic interpolation for non-integer coordinates.

Results We evaluate on the Middlebury optical ﬂow benchmark (Baker et al., 2011) using 11 random initializations. D-PMP and T-PMP utilize the same set of proposals (75% neighbor, 25% random walk). We compute SLIC superpixels (Achanta et al., 2012) with region size 5 and regularizer 0.1; about 5,000 to 15,000 per image. We use

Avg. Log-Prob. (p value) Avg. EPE (p value) RMP -2.446E6 (0.008) 1.623 (0.008) G-PMP -1.408E6 (0.008) 0.699 (0.008) T-PMP -1.212E6 (0.008) 0.382 (0.727) D-PMP -1.209E6 ( ) 0.362 ( ) Classic-C 0.349 (0.727) Table 1. Optical ﬂow MAP estimates. Average log-probability and AEPE over 11 random initializations on the Middlebury training set. Reported p values are compared to D-PMP using a Wilcoxon signed rank test, we consider p < 0.05 signiﬁcant.

the Charbonnier widths σ = 0.001 recommended for this model (Sun et al., 2014), but learn different scaling parameters (λs = 16, λd = 1) to compensate for our superpixel representation.

The Middlebury training set contains 8 images with ground truth ﬂow, and we report log-probability quantiles over this set (Fig. 7 (left)). To demonstrate diversity in the particle sets we report average endpoint error (AEPE) of the oracle solution we choose the ﬂow particle closest to ground truth in the order given by the particle selection step

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

10 20 30 40 50 1.05

Log Probability

0 10 20 30 40 50 1.4

Log Probability

Number of Particles

Average RMSD

G PMP T PMP D PMP L1 D PMP L Rosetta

4000 2000 0 2000 4000

Log Probability

Figure 8. Side chain prediction. We compare each method and both L1 and L diverse selection methods. Left: Total log-probability over 20 proteins. Median (solid) and best/worst (dashed) results on 11 random initializations. Left-Center: Total log-probability for 370 proteins. Right-Center: RMSD (in angstroms A) of the oracle solution on larger set. Right: Log-probability of all 370 proteins versus the ﬁxed rotamer discretization with RMP inference.

(Fig. 7 (left-center)). D-PMP shows a large reduction in AEPE after just a few particles. T-PMP remains nearly ﬂat, suggesting little diversity. In just two dimensions the Gaussian spread of G-PMP particles naturally leads to an error reduction, although higher. The beneﬁt of particle diversity is best visualized near object boundaries (see Fig. 6).

We compare to a specialized coarse-to-ﬁne, multiscale inference algorithm for Classic-C1, using default settings and with the median ﬁlter disabled. We also compare to RMP on a ﬁxed regular discretization of 200 ﬂow vectors. As shown in Table 1, D-PMP yields signiﬁcantly higher probability solutions, but is equivalent to T-PMP in AEPE. DPMP also achieves equivalent results to Classic-C optimization, which is highly tuned to the Middlebury dataset.

We cannot directly compare probability of the Classic-C and D-PMP solutions, because the former models ﬂow at the pixel level. Instead, using L-BFGS initialized from the D-PMP solution, we optimize the pixel level model and compare log-probability of the result with Classic-C for both training and test sequences (Fig. 7 (right)). Again, even compared to a highly-tuned specialized optimization method, D-PMP achieves statistically equivalent results.

4.2. Protein Side Chain Prediction

Most computational approaches optimize side chain placement over a standard discretization, known as a rotamer library (Bower et al., 1997; Fromer et al., 2010). Rotamer conﬁgurations are learned from the marginal statistics of experimentally validated side chains and generally allocate three states {60 , 180 , 300 } for each dihedral angle, resulting in up to 81 possible states per node. This is a coarse discretization which can fail to capture important details of the side chain placement. Applying D-PMP we optimize the continuous energy function, allowing estimation of non-rotameric side chains which do not obey the standard discretization (see Fig. 9). Log-likelihoods are the so-called Dunbrack probabilities Gaussian mixtures with

1http://people.seas.harvard.edu/ dqsun Experiments use code accessed on 06 February 2015.

Estimate Rotamers D-PMP Particles

Figure 9. Non-rotameric side chains. Left: X-ray (green), RMP (red), Rosetta (magenta) and D-PMP (black) estimates. Center: Standard rotamers are all poor approximations. Right: Final DPMP particles all overlapping the level set of the electron density (mesh). (PDB: 1GK9, Trp154) (Shapovalov & Dunbrack, 2007)

means centered on rotamer conﬁgurations.

Results We evaluate the energy function with Rosetta (Rohl et al., 2004), a state-of-the-art molecular modeling package. We conﬁgure the Rosetta energy using three terms: the Lennard-Jones attractive and repulsive (fa_atr, fa_rep) terms and the Dunbrack probabilities (fa_dun), each with unit weight. We run PMP with 50 particles for 50 iterations. D-PMP and T-PMP proposals are 50% random walks from Gaussians wrapped to account for angular discontinuities, and 50% samples from the rotamer marginals. Neighbor-based proposals are not used, due to the complex transformation between dihedral angles and atom locations. We compare to Rosetta s implementation of simulated annealing using Metropolis-Hastings proposals from the discrete rotamers, followed by local continuous optimization.

We experiment on two sets of proteins selected from the Protein Data Bank2, resolved using X-ray diffraction below 1.5A resolution, and less than 1000 amino acids. We run each method from 11 random initialization on a small set (20 proteins) and report quantiles of total log-probability (Fig. 8 (left)). Both D-PMP and T-PMP outperform GPMP, due to their ability to exploit the model likelihood through rotamer proposals, with D-PMP showing the tightest conﬁdence intervals. The second set is larger (370 proteins) and we report the total log-probability of a single run

2http://www.pdb.org

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Figure 10. Side chain particles. Top Row: Final particles for T-PMP and D-PMP, and the ground truth conformation of a single protein (PDB ID: 2H8E). Region marked in red is detailed below. Bottom Row: Closeup of ﬁrst ten amino acids, showing the ﬁxed backbone (black) and ﬁnal particles colored by backbone location. D-PMP preserves more diverse particles in areas of uncertainty.

for each method (Fig. 8 (left-center)).

Diversity is important in structure prediction, since proteins are known to alternate between many stable conﬁgurations (Ma et al., 1999). Fig. 10 shows a qualitative comparison of diversity between D-PMP and T-PMP for a single protein. To measure diversity we report RMSD of the oracle solution (Fig. 8); D-PMP shows a substantial improvement in accuracy after only a few particles. We also compare the submodular particle selection (L1) with the minimax formulation (L ); both preserve diversity similarly, but the former offers stronger theoretical justiﬁcation.

5. Discussion

We have generalized previous PMP algorithms in several substantial ways. Our proposed extensions to D-PMP not only allow inference in loopy MRFs, but our reformulation of the particle selection IP allows for greedy optimization within a guaranteed optimality bound. We demonstrate effectiveness in protein structure prediction, where we are substantially more accurate than the G-PMP algorithm that Peng et al. (2011) applied to a broader structure prediction task, and the state-of-the-art Rosetta package. The same general-purpose D-PMP algorithm is also competitive with standard inference algorithms for a very loopy optical ﬂow model. A MATLAB library, built on UGM (Schmidt, 2007), implementing these methods is available3.

Acknowledgements We thank Silvia Zufﬁfor her advice about connections between submodularity and the particle

3http://www.cs.brown.edu/ pachecoj

selection problem. This research supported in part by ONR Award No. N00014-13-1-0644. J. Pacheco supported in part by funding from the Naval Undersea Warfare Center, Division Newport, Rhode Island.

Appendix. Proof of Proposition 1

To simplify we ignore normalization terms and drop dependence on z so ˆm(z) = ˆm. The proof is by induction on the number of neighbors, for the base case let Γ(s) = {i, j}:

h (mis(xs)ρis ˆmis(xs)ρis)mjs(xs)ρjs

+ (mjs(xs)ρjs ˆmjs(xs)ρjs) ˆmis(xs)ρisi

h (mis(xs)ρis ˆmis(xs)ρis)

+ (mjs(xs)ρjs ˆmjs(xs)ρjs) i

mis ˆmis ρis 1 + mjs ˆmjs ρjs 1 The ﬁrst inequality drops ψs [0, 1], and | | since ˆmts mts. The second inequality holds since m, ˆm [0, 1]. The last follows from the triangle inequality since, |x y|ρ is a metric (though not a norm for ρ (0, 1)). For the inductive step let Γ(s) = {t1, . . . , tn} and assume:

ν\n s ˆν\n s 1 X

i =n mtis ˆmtis ρtis 1

where ν\n s (xs) is the product of all messages except mtns: νs ˆνs 1 mtns ˆmtns ρtns 1 + ν\n s ˆν\n s 1

i=1 mtis ˆmtis ρtis 1 .

Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Susstrunk, S. SLIC superpixels compared to state-ofthe-art superpixel methods. IEEE PAMI, 34(11):2274 2282, 2012.

Aji, S. M. and Mc Eliece, R. J. The generalized distributive law. IEEE Info. Theory, 46(2):325 343, 2000.

Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I. An introduction to MCMC for machine learning. JMLR, 50(1-2):5 43, 2003.

Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., and Szeliski, R. A database and evaluation methodology for optical ﬂow. IJCV, 92(1):1 31, 2011.

Besse, F., Rother, C., Fitzgibbon, A., and Kautz, J. PMBP: Patchmatch belief propagation for correspondence ﬁeld estimation. In BMVC, 2012.

Bower, M. J., Cohen, F. E., and Dunbrack Jr, R. L. Prediction of protein side-chain rotamers from a backbonedependent rotamer library: a new homology modeling tool. Journal of molecular biology, 267(5):1268 1282, 1997.

Fromer, M., Yanover, C., Harel, A., Shachar, O., Weiss, Y., and Linial, M. Sprint: side-chain prediction inference toolbox for multistate protein design. Bioinformatics, 26 (19):2466 2467, 2010.

Geman, S. and Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE PAMI, 6(6):721 741, November 1984.

Kothapa, R., Pacheco, J., and Sudderth, E. Max-product particle belief propagation. Master s project report, Brown University Dept. of Computer Science, 2011.

Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., Van Briesen, J., and Glance, N. Cost-effective outbreak detection in networks. In KDD, pp. 420 429. ACM, 2007.

Ma, B., Kumar, S., Tsai, C.-J., and Nussinov, R. Folding funnels and binding mechanisms. Protein Engineering, 12(9):713 720, 1999.

Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pp. 234 243. Springer, 1978.

Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions. Math. Prog., 14(1):265 294, 1978.

Pacheco, J., Zufﬁ, S., Black, M., and Sudderth, E. Preserving modes and messages via diverse particle selection. In ICML, pp. 1152 1160, 2014.

Pearl, J. Probabilistic Reasoning in Intelligent systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA, 1988.

Peng, J., Hazan, T., Mc Allester, D., and Urtasun, R. Convex max-product algorithms for continuous MRFs with applications to protein folding. In ICML, 2011.

Rohl, C. A., Strauss, C. E. M., Misura, K. M. S., and Baker, D. Protein structure prediction using rosetta. Methods in enzymology, 383:66 93, 2004.

Schmidt, M. UGM: A matlab toolbox for probabilistic undirected graphical models. http://www.cs. ubc.ca/ schmidtm/Software/UGM.html, 2007.

Shapovalov, M. V. and Dunbrack, R. L. Statistical and conformational analysis of the electron density of protein side chains. Proteins: Struct., Func., and Bioinf., 66(2): 279 303, 2007.

Soltan Ghoraie, L., Burkowski, F., Li, S. C., and Zhu, M. Residue-speciﬁc side-chain polymorphisms via particle belief propagation. IEEE/ACM Trans. on Comp. Bio. and Bioinf., 2013.

Sun, D., Roth, S., and Black, M. J. A quantitative analysis of current practices in optical ﬂow estimation and the principles behind them. IJCV, 106(2):115 137, 2014.

Trinh, H. and Mc Allester, D. Unsupervised learning of stereo vision with monocular cues. In BMVC, 2009.

Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1 305, 2008.

Wainwright, M. J., Jaakkola, T. S., and Willsky, A. S. Map estimation via agreement on trees: message-passing and linear programming. Information Theory, IEEE Transactions on, 51(11):3697 3717, 2005.

Weiss, Y., Yanover, C., and Meltzer, T. Map estimation, linear programming and belief propagation with convex free energies. In UAI, 2007.

Yanover, C. and Weiss, Y. Approximate inference and protein-folding. In Advances in neural information processing systems, pp. 1457 1464, 2002.