# learning_fastmixing_models_for_structured_prediction__11d92755.pdf Learning Fast-Mixing Models for Structured Prediction Jacob Steinhardt JSTEINHARDT@CS.STANFORD.EDU Percy Liang PLIANG@CS.STANFORD.EDU Stanford University, 353 Serra Street, Stanford, CA 94305 USA Markov Chain Monte Carlo (MCMC) algorithms are often used for approximate inference inside learning, but their slow mixing can be difficult to diagnose and the resulting approximate gradients can seriously degrade learning. To alleviate these issues, we define a new model family using strong Doeblin Markov chains, whose mixing times can be precisely controlled by a parameter. We also develop an algorithm to learn such models, which involves maximizing the data likelihood under the induced stationary distribution of these chains. We show empirical improvements on two inference tasks. 1. Introduction Conventional wisdom suggests that rich features and highly-dependent variables necessitate intractable inference. Indeed, the dominant paradigm is to first define a joint model, and then use approximate inference (e.g., MCMC) to learn that model. While this recipe can generate good results in practice, it has two notable drawbacks: (i) diagnosing convergence of Markov chains is extremely difficult (Gelman & Rubin, 1992; Cowles & Carlin, 1996); and (ii) approximate inference can be highly suboptimal in the context of learning (Wainwright, 2006; Kulesza & Pereira, 2007). In this paper, we instead use MCMC to define the model family itself: for a given T, we construct a family of Markov chains using arbitrary rich features, but whose mixing time is guaranteed to be at most O(T). The corresponding stationary distributions determine the model family, with larger T leading to more expressive model families. We can think of our Markov chains as parameterizing a family of computationally accessible distributions, where the amount of computation is controlled by T. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). Concretely, suppose we are performing a structured prediction task from an input x to an output y. We construct Markov chains of the following form, called strong Doeblin chains (Doeblin, 1940): Aθ(yt | yt 1, x) = (1 ϵ)Aθ(yt | yt 1, x) + ϵ uθ(yt | x), (1) where ϵ is a mixture coefficient and θ parameterizes Aθ and uθ. For intuition, think of uθ as a simple tractable model and Aθ as Gibbs sampling in a complex intractable model. With probability 1 ϵ, we progress according to Aθ, and with probability ϵ, we draw a fresh sample from uθ, which corresponds to an informed random restart. Importantly, uθ does not depend on the previous state yt 1. When ϵ = 1, we are drawing i.i.d. samples from uθ; we therefore mix in a single step, but our stationary distribution must necessarily be very simple. When ϵ = 0, the stationary distribution can be much richer, but we have no guarantees on the mixing time. For intermediate values of ϵ, we trade off between representational power and mixing time. A classic result is that a given strong Doeblin chain mixes in time at most 1 ϵ (Doeblin, 1940), and that we can draw an exact sample from the stationary distribution in expected time O( 1 ϵ ) (Corcoran & Tweedie, 1998). In this work, we prove new results that help us understand the strong Doeblin model families. Let F and F be the family of stationary distributions corresponding to Aθ and Aθ as defined in (1), respectively. Our first result is that as ϵ decreases, the stationary distribution of any Aθ monotonically approaches the stationary distribution of Aθ (as measured by either direction of the KL divergence). Our second result is that if 1 ϵ is much larger than the mixing time of Aθ, then the stationary distributions of Aθ and Aθ are close under a certain Mahalanobis distance. This shows that any member of F that is computationally accessible via the Markov chain is well-approximated by its counterpart in F. Learning Fast-Mixing Models for Structured Prediction The figure above shows F and F, together with the subset F0 of F whose Markov chains mix quickly. F (approximately) covers F0, and also contains some distributions outside of F. In order to learn over F, we show how to maximize the likelihood of the data under the stationary distribution of Aθ. Specifically, we show that we can compute a stochastic gradient of the log-likelihood in expected time O( 1 ϵ ). Thus, in a strong sense, our objective function explicitly accounts for computational constraints. We also generalize strong Doeblin chains, which are a mixture of two base chains, uθ and Aθ, to staged strong Doeblin chains, which allow us to combine more than two base chains. We introduce an auxiliary variable z representing the stage that the chain is in. We then transition between stages, using the base chain corresponding to the current stage z to advance the concrete state y. A common application of this generalization is defining a sequence of increasingly more complex chains, similar in spirit to annealing. This allows sampling to become gradually more sensitive to the structure of the problem. We evaluated our methods on two tasks: (i) inferring words from finger gestures on a touch screen and (ii) inferring DNF formulas for program verification. Unlike many structured prediction problems where local potentials provide a large fraction of the signal, in the two tasks above, local potentials offer a very weak signal; reasoning carefully about the higher-order potentials is necessary to perform well. On word inference, we show that learning strong Doeblin chains obtains a 3.6% absolute improvement in character accuracy over Gibbs sampling while requiring 5x fewer samples. On DNF formula inference, our staged strong Doeblin chain obtains an order of magnitude speed improvement over plain Metropolis-Hastings. To summarize, the contributions of this paper are: We formally define a family of MCMC algorithms based on strong Doeblin chains with guaranteed fast mixing times (Section 2). We provide an extensive analysis of the theoretical properties of this family (Section 3), together with a generalization to a staged version (Section 3.1). We provide an algorithm for learning the parameters of a strong Doeblin chain (Section 4). We demonstrate superior experimental results relative to baseline MCMC samplers on two tasks, word inference and DNF formula synthesis (Section 5). 2. A Fast-Mixing Family of Markov Chains Given a Markov chain with transition matrix A(yt | yt 1) and a distribution u(yt), define a new Markov chain with transitions given by A(yt | yt 1) def = (1 ϵ)A(yt | yt 1)+ ϵu(yt). (We suppress the dependence on θ and x for now.) 10 6 10 4 10 2 10 0 Stationary Probability vs. Restart Probability Figure 1. Left: A simple 3-state Markov chain. Arcs denote transition probabilities. Right: plot of the stationary probability of state 1 as a function of restart probability ϵ, for δ = 10 4; closer to 0.499 (the true probability) is better. Note the two regimes for ϵ δ and ϵ δ. In matrix notation, we can write A as A def = (1 ϵ)A + ϵu1 . (2) In other words, with probability ϵ we restart from u; otherwise, we transition according to A. Intuitively, A should mix quickly because a restart from u renders the past independent of the future (we formalize this in Section 3). We think of u as a simple tractable model that provides coverage, and A as a complex model that provides precision. Simple example. To gain some intuition, we work through a simple example with the Markov chain A depicted in Figure 1. The stationary distribution of this chain is 1 2+3δ 3δ 2+3δ 1 2+3δ , splitting most of the probability mass evenly between states 1 and 3. The mixing time of this chain is approximately 1 δ , since once the chain falls into either state 1 or state 3, it will take about 1 δ steps for it to escape back out. If we run this Markov chain for T steps with T 1 δ , then our samples will be either almost all in state 1 or almost all in state 3, and thus will provide a poor summary of the distribution. If instead we perform random restarts with probability ϵ from a uniform distribution u over {1, 2, 3}, then the restarts give us the opportunity to explore both modes of the distribution. After a restart, however, the chain will more likely fall into state 3 than state 1 ( 5 9 probability vs. 4 9), so for ϵ δ the stationary distribution will be noticeably perturbed by the restarts. If ϵ δ, then there will be enough time for the chain to mix between restarts, so this bias will vanish. See Figure 1 for an illustration of this phenomenon. 3. Theoretical Properties Markov chains that can be expressed according to (2) are said to have a strong Doeblin parameter ϵ (Doeblin, 1940). In this section, we characterize the stationary distribution and mixing time of A, and also relate the stationary distribution of A to that of A as a function of ϵ. Often the easiest Learning Fast-Mixing Models for Structured Prediction way to study the mixing time of A is via its spectral gap, which is defined as 1 λ2( A), where λ2( A) is the secondlargest eigenvalue (in complex norm). A standard result for Markov chains is that, under mild assumptions, the mixing time of A is O 1 1 λ2( A) . We assume throughout this section that A is ergodic but not necessarily reversible. See Section 12.4 of Levin et al. (2008) for more details. Our first result relates the spectral gap (and hence the mixing time) to ϵ. This result (as well as the next) are wellknown but we include them for completeness. For most results in this section, we sketch the proof here and provide the full details in the supplement. Proposition 3.1. The spectral gap of A is at least ϵ; that is, 1 λ2( A) ϵ. In particular, A mixes in time O( 1 The key idea is that all eigenvectors of A and A are equal except for the stationary distribution, and that λk( A) = (1 ϵ)λk(A) for k > 1. Having established that A mixes quickly, the next step is to characterize its stationary distribution: Proposition 3.2. Let π be the stationary distribution of A. Then j=0 (1 ϵ)j Aju = ϵ(I (1 ϵ)A) 1u. (3) This can be directly verified algebraically. The summation over j shows that we can in fact draw an exact sample from π by drawing T Geometric(ϵ),1 initializing from u, and transitioning T times according to A. This is intuitive, since at a generic point in time we expect the most recent sample from u to have occurred Geometric(ϵ) steps ago. Note that E[T + 1] = 1 ϵ , which is consistent with the fact that the mixing time is O( 1 ϵ ) (Proposition 3.1). We would like to relate the stationary distributions π and π of A and A, respectively. The next two results (which are new) do so. Let πϵ denote the stationary distribution of A at a particular value of ϵ; note that π1 = u and π0 = π. We will show that πϵ approaches π monotonically, for both directions of the KL divergence. In particular, for any ϵ < 1, πϵ is at least as good as u at approximating π. To show this, we make use of the following lemma from Murray & Salakhutdinov (2008): Lemma 3.3. If B is a transition matrix with stationary distribution π, then KL (π Bπ ) KL (π π ) and KL (Bπ π) KL (π π). 1If T Geometric(ϵ), we have P[T = j] = ϵ(1 ϵ)j for j 0. u A1 A2 1 ϵ1 Figure 2. Markov chains over (a) two stages (strong Doeblin chains); and (b) three stages (restart from u followed by transitions from A1 and then from A2). Using this lemma, we can prove the following monotonicity result: Proposition 3.4. Both KL ( πϵ π) and KL (π πϵ) are monotonic functions of ϵ. The idea is to construct a transition matrix B that maps πϵ1 to πϵ2 for given ϵ2 < ϵ1, then show that its stationary distribution is π and apply Lemma 3.3. With Proposition 3.4 in hand, a natural next question is how small ϵ must be before π is reasonably close to π. Proposition 3.5 provides one such bound: π is close to π if ϵ is small compared to the spectral gap 1 λ2(A). Proposition 3.5. Suppose that A satisfies detailed balance with respect to π. Let π be the stationary distribution of A. Define dπ(π ) def = π π diag(π) 1 = q where v M is the Mahalonobis distance v Mv. Then dπ( π) ϵ 1 λ2(A) dπ(u). (In particular, dπ( π) 1 if ϵ (1 λ2(A))/dπ(u).) The proof is somewhat involved though based on classical arguments (for instance, Chapter 12 of Levin et al. (2008)). The key step is to establish that dπ(π ) is convex in π and contractive with respect to A (more precisely, that dπ(Aπ ) λ2(A)dπ(π )). Proposition 3.5 says that if A mixes quickly (in particular, in time much smaller than 1 ϵ ), then A and A will have similar stationary distributions. This serves as a sanity check: if A already mixes quickly, then π is a good approximation to π; otherwise, the Doeblin construction ensures that we are at least converging to some distribution, which by Proposition 3.4 approximates π at least as well as u does. 3.1. Staged strong Doeblin chains Recall that to run a strong Doeblin chain A, we first sample from u, and then transition according to A for approximately 1 ϵ steps. The intuition is that sampling from the crude distribution u faciliates global exploration of the state space, while the refined transition A hones in on a mode. However, for complex problems, there might be a considerable gap between what is possible with exact inference (u) Learning Fast-Mixing Models for Structured Prediction and what is needed for accurate modeling (A). This motivates using multiple stages of MCMC to bridge the gap. To do this, we introduce an auxiliary variable z Z denoting which stage of MCMC we are currently in. For each stage z, we have a Markov chain Az(yt | yt 1) over the original state space. We also define a Markov chain C(zt | zt 1) over the stages. To transition from (yt 1, zt 1) to (yt, zt), we first sample zt from C(zt | zt 1) and then yt from Azt(yt | yt 1). If there is a special state z for which Az (yt | yt 1) = u(yt) (i.e., Az does not depend on yt 1), then we call the resulting chain a staged strong Doeblin chain. For example, if z {0, 1} and we transition from 0 to 1 with probability 1 ϵ and from 1 to 0 with probability ϵ, then we recover strong Doeblin chains assuming z = 0 (Figure 2(a)). As another example (Figure 2(b)), let z {0, 1, 2}. When z = z = 0, we transition according to a restart distribution u1 ; when z = 1, we transition according to a simple chain A1; and when z = 2, we transition according to a more complex chain A2. If we transition from z = 0 to z = 1 with probability 1, from z = 1 to z = 2 with probability ϵ1, and from z = 2 to z = 0 with probability ϵ2, then we will on average draw 1 sample from u, 1 ϵ1 samples from A1, and 1 ϵ2 samples from A2. We now show that staged strong Doeblin chains mix quickly as long as we visit z reasonably often. In particular, the following theorem provides guarantees on the mixing time that depend only on z and on the structure of C(zt | zt 1), analogous to the dependence only on ϵ for non-staged chains. The condition of the theorem asks for times a and b such that the first time after a that we hit z is almost independent of the starting state z0, and is less than b with high probability. Theorem 3.6. Let M be a staged strong Doeblin chain on Z Y. Let τa be the earliest time s a for which zs = z . Let βa,s = minz Z P[τa = s | z0 = z] and γa,b def = Pb s=a βa,s. Then M b has strong Doeblin parameter γa,b, and the spectral gap of M is at least γa,b b . (Setting a = b = 1 recovers Proposition 3.1.) The key idea is that, conditioned on τa, (yb, zb) is independent of (y0, z0) for all b τa. For the special case that the stages form a cycle as in Figure 2, we have the following corollary: Corollary 3.7. Let C be a transition matrix on {0, . . . , k 1} such that C(zt = i | zt 1 = i) = 1 δi and C(zt = (i + 1) mod k | zt 1 = i) = δi. Suppose that δk 1 1 max(2,k 1) min{δ0, . . . , δk 2}. Then the spectral gap of the joint Markov chain is at least δk 1 The key idea is that, when restricting to the time interval [2/δk 1, 3/δk 1], the time of first transition from k 1 to 0 is approximately Geometric(δk 1)-distributed (independent of the initial state), which allows us to invoke Theorem 3.6. We expect the optimal constant to be much smaller than 78. 4. Learning strong Doeblin chains Section 3 covered properties of strong Doeblin chains (1 ϵ)Aθ + ϵuθ1 for a fixed parameter vector θ. Now we turn to the problem of learning θ from data. We will focus on the discriminative learning setting where we are given a dataset {(x(i), y(i))}n i=1 and want to maximize the conditional loglikelihood: i=1 log pθ(y(i) | x(i)), (4) where now pθ is the stationary distribution of Aθ = (1 ϵ)Aθ + ϵuθ1 . We assume for simplicity that the chains Aθ and uθ are given by conditional exponential families: Aθ(y | y , x) def = exp θ φ(x, y , y) log Z(θ; x, y) , uθ(y | x) def = exp θ φ(x, y) log Z(θ; x) , (5) where each φ outputs a feature vector and the Z are partition functions. By Proposition 3.1, Aθ mixes quickly for all θ. On the other hand, the parameterization of Aθ captures a rich family of transition kernels, including Gibbs sampling. At a high level, our learning algorithm performs stochastic gradient descent on the negative log-likelihood. However, the negative log-likelihood is only defined implicitly in terms of the stationary distribution of a Markov chain, so the main challenge is to show that it can be computed efficiently. To start, we assume that we can operate on the base chains uθ and Aθ for one step efficiently: Assumption 4.1. We can efficiently sample y from uθ( | x) and Aθ( | y , x), as well as compute log uθ(y|x) θ and log Aθ(y|y ,x) Under Asssumption 4.1, we will show how to efficiently compute the gradient of log pθ(y(i) | x(i)) with respect to θ. The impatient reader may skip ahead to the final pseudocode, which is given in Algorithm 1. For convenience, we will suppress the dependence on x and i and just refer to pθ(y) instead of pθ(y(i) | x(i)). Computing the gradient of log pθ(y) is non-trivial, since the formula for pθ is somewhat involved (see Proposition 3.2): j=0 (1 ϵ)j[Aj θuθ](y). (6) It is useful to invoke the following generic identity on gradients of conditional log-probabilities, proved in the supplement. Learning Fast-Mixing Models for Structured Prediction Lemma 4.2. Let z have distribution pθ(z) parameterized by a vector θ. Let S be any measurable set. Then log pθ(z S) We can utilize Lemma 4.2 by interpreting y | θ as the output of the following generative process, which by Proposition 3.2 yields the stationary distribution of Aθ: Sample y0 from uθ and yt+1 | yt from Aθ for t = 0, 1, . . .. Sample T Geometric(ϵ) and let y = y T . We then invoke Lemma 4.2 with z = (T, y0:T ) and S encoding the event that y T = y. As long as we can sample from the posterior distribution of (T, y0:T ) conditioned on y T = y, we can compute an estimate of θ log pθ(y) as follows: Sample (T, y0:T ) | y T = y. Return log pθ(T,y0:T ) θ = log uθ(y0) θ + PT t=1 log Aθ(yt|yt 1) 4.1. Sampling schemes for (T, y0:T ) By the preceding discussion, it suffices to construct a sampler for (T, y0:T ) | y T = y. A natural approach is to use importance sampling: sample (T, y0:T 1), then weight by p(y T = y | y T 1). However, this throws away a lot of work we make T MCMC transitions but obtain only one sample (T, y0:T ) with which to estimate the gradient. We would like to ideally make use of all the MCMC transitions when constructing our estimate of (T, y0:T ) | y T = y. For any t T, the pair (t, y0:t) would itself have been a valid sample under different randomness, and we would like to exploit this. Suppose that we sample T from some distribution F and let q(t) be the probability that T t under F. Then we can use the following scheme: Sample T from F, then sample y0:T 1. For t = 0, . . . , T, weight (t, y0:t 1) by ϵ(1 ϵ)t q(t) p(yt = y | yt 1). For any q, this yields unbiased (although unnormalized) weights (see Section B in the supplement). Typically we will choose q(t) = (1 ϵ)t, e.g. F is a geometric distribution. If the yt are perfectly correlated, this will not be any more effective than vanilla importance sampling, but in practice this method should perform substantially better. Even though we obtain weights on all of y0:T , these weights will typically be highly correlated, so we should still repeat the sampling procedure multiple times to minimize the bias from estimating the normalization constant. The full procedure is given as pseudocode in Algorithm 1. Algorithm 1 Algorithm for computing an estimate of θ log pθ(y | x). This estimate is asymptotically unbiased as the number of samples k , but will be biased for a finite number of samples due to variance in the estimate of the normalization constant. Sample Gradient(x, y, θ, ϵ, k) k is the number of samples to take Z 0; g 0 Z is the total importance mass of all samples, g Z is the gradient for i = 1 to k do Sample T Geometric(ϵ) Sample y0 from uθ( | x) For 1 t T 1: sample yt from Aθ( |yt 1, x) w0 ϵ uθ(y) For 1 t T: wt ϵ Aθ(y | yt 1, x) Z Z + PT t=0 wt g g + w0 log uθ(y|x) θ +PT t=1 wt log uθ(y0|x) +Pt 1 s=1 log Aθ(ys|ys 1,x) θ + log Aθ(y|yt 1,x) θ . end for Output g 4.2. Implementation With the theory above in place, we now describe some important implementation details of our learning algorithm. At a high level, we can just use Algorithm 1 to compute estimates of the gradient and then apply an online learning algorithm such as ADAGRAD (Duchi et al., 2010) to identify a good choice of θ. Since the log-likelihood is a nonconvex function of θ, the initialization is important. We make the following (weak) assumption: Assumption 4.3. The chains uθ and Aθ are controlled by disjoint coordinates of θ, and for any setting of uθ there is a corresponding choice of Aθ that leaves uθ invariant (i.e., Aθuθ = uθ). In practice, Assumption 4.3 is easy to satisfy. For instance, suppose that φ : Y Rd is a feature function, θ = [θ0 θ1] Rd0+d are the features controlling u and A, and uθ0 is made tractable by zeroing some features out: uθ0(y) exp([θ0 0d d0] φ(y)). Also suppose that Aθ1 is a Gibbs sampler that uses all the features: Aθ1(y | y ) exp(θ 1 φ(yi, y i)), where i is a randomly chosen coordinate of y. Then, we can satisfy Assumption 4.3 by setting θ1 = [θ0 0d d0]. Under Assumption 4.3, we can initialize θ by first training u in isolation (which is a convex problem since uθ parameterizes an exponential family), then initializing A to leave u invariant; this guarantees that the initial log-likelihood is what we would have obtained by just using u by itself. We Learning Fast-Mixing Models for Structured Prediction Figure 3. Generated sequence of keyboard gestures for the word banana. The input x is a sequence of characters (the recorded key presses), and the output y is the intended word. Most characters in x are incidental and do not correspond to any character in y; this is reflected by the (unobserved) alignment z. found this to work well empirically. As another note, Algorithm 1 na ıvely looks like it takes O(T 2) time to compute the gradient for each sample, due to the nested sum. However, most terms are of the form wt log Aθ(ys|ys 1,x) θ ; by grouping them for a fixed s we can compute the sum in O(T) time, leading to expected runtime O k ϵ for Algorithm 1 (since E[T + 1] = 1 5. Experiments We validated our method on two non-trivial inference tasks. These tasks are difficult due to the importance of high-arity factors; local information is insufficient to even identify high-probability regions of the space. Inferring Words from Keyboard Gestures. We first considered the task of inferring words from keyboard gestures. We generated the data by sampling words from the New York Times corpus (Sandhaus, 2008). For each word, we used a time series model to synthetically generate finger gestures for the word. A typical instantiation of this process is given in Figure 3. The learning task is to discriminatively infer the intended word y given the sequence of keys x that the finger was over (for instance, predicting banana from bdsadbnnnfaassjjj). In our model, we posit a latent alignment z between key presses and intended letter. Given an input x of length l, the alignment z also has length l; each zi is either c (xi starts an output letter c), -c (xi continues an output letter c), or # (xi is unaligned); see Figure 3 for an example. Note that y is a deterministic function of z. The base model uθ consists of indicator features on (xi, zi), (xi, zi 1, zi), and (xi, xi 1, zi). The full Aθ is a Gibbs sampler in a model where we include the following features in addition to those above: Indicator features on (xi, yi, yi 1) Indicator of y being in the dictionary, as well as log of word frequency (conditioned on being in the dictionary) For each i, indicator of y1:i matching a prefix of a word in the dictionary We compared three approaches: Our approach (Doeblin sampling) Regular Gibbs sampling, initialized by setting zi = xi for all i (basic-Gibbs) Gibbs sampling initialized from uθ (uθ-Gibbs) At test time, all three of these methods are computationally almost identical: they all initialize from some distribution, then make a certain number of Gibbs samples. For basic Gibbs and uθ-Gibbs, this is always a fixed number of steps T, while for Doeblin sampling, the number of steps is a geometric distribution with mean T. The main difference is in how the methods are trained. Our method is trained using the ideas in Section 4; for the other two methods, we train by approximating the gradient: log pθ(y | x) = Eˆz pθ(z|x,y)[φ(y, ˆz, x)] Eˆy,ˆz pθ(y,z|x)[φ(ˆy, ˆz, x)], where φ(y, z, x) is the feature function and pθ is the stationary distribution of Aθ. For the second term, we use MCMC samples from Aθ to approximate pθ(y, z | x). For the first term, we could take the subset of samples where ˆy = y, but this is problematic if no such samples exist. Instead, we reweight all samples with ˆy = y by exp( (D+1)), where D is the edit distance between y and ˆy. We use the same reweighting approach for the Doeblin sampler, using this as the importance weight rather than using Aθ(y | yt 1) as in Algorithm 1. To provide a fair comparison of the methods, we set ϵ in the Doeblin sampler to the inverse of the number of transitions T, so that the expected number of transitions of all algorithms is the same. We also devoted the first half of each chain to burn-in. All algorithms are trained with Ada Grad (Duchi et al., 2010) with 16 independent chains for each example. We measure word-level accuracy by computing the fraction of (non-burn-in) samples whose output y is correct. The results are reported in Figure 4. Overall, our Doeblin sampler outperforms uθ-Gibbs by a significant margin, which in turn outperforms basic-Gibbs. Interestingly, while the accuracy of our method continues to improve with more training time, uθ-Gibbs quickly asymptotes and then slightly decreases, even for training accuracy. What is happening to uθ-Gibbs? Since the inference problem in this task is hard, the samples provide a poor gradient approximation. As a result, optimization methods that take the approximation at face value may not converge to even a Learning Fast-Mixing Models for Structured Prediction 0.0 0.2 0.4 0.6 0.8 1.0 0.00 5 10 15 training passes Doeblin (20) uµ-Gibbs (20) 5 10 15 training passes Doeblin (20) Doeblin (50) Doeblin (100) 5 10 15 training passes uµ-Gibbs (20) uµ-Gibbs (50) uµ-Gibbs (100) 0.0 0.2 0.4 0.6 0.8 1.0 0.00 5 10 15 training passes Doeblin (20) uµ-Gibbs (20) 5 10 15 training passes Doeblin (20) Doeblin (50) Doeblin (100) 5 10 15 training passes uµ-Gibbs (20) uµ-Gibbs (50) uµ-Gibbs (100) Figure 4. Plots of word-level (left) and character-level (right) accuracy. The first panel gives the performance of all 3 methods (Doeblin sampling, uθ-Gibbs, and basic-Gibbs) for a computational budget of 20 transitions per example. The second and third panels show the accuracy of Doeblin sampling and uθ-Gibbs, respectively, for increasing computational budgets (20, 50, and 100 transitions). local optimum. This phenomenon has already been studied in other contexts, for instance by Kulesza & Pereira (2007) and Huang et al. (2012). In contrast, our method directly optimizes the loglikelihood of the data under the distribution πθ, so that accuracy continues to increase with more passes through the training data. This demonstrates that the MCMC samples do provide enough signal to train from, but that na ıvely plugging them into a method designed for exact inference will fail to exploit that signal. Inferring DNF Formulas. Next, we study the use of our staged Doeblin chain construction as a tool for hierarchical initialization. We ignore learning for now, instead treating MCMC as a stochastic search algorithm. Our task of interest is to infer a DNF formula f from its input-output behavior. This is an important subroutine in loop invariant synthesis, where MCMC methods have recently shown great promise (Gulwani & Jojic, 2007; Sharma & Aiken, 2014). Concretely, we are given the output of an unknown DNF formula f for various inputs x = (x1, x2, x3); for instance: f(1, 2, 3) = True f(1, 4, 4) = True f(0, 1, 0) = False f(0, 2, 2) = True Our task is to induce f; in this case, f(x1, x2, x3) = [x1 = 0] [x2 = x3]. In general, we avoid trivial solutions that overfit the data by imposing limits on the size of f. More formally, we consider DNF formulae with linear inequalities: f(x) = Wn i=1 Vm j=1[a ijx bij], where aij, x Zd and bij Z. The formula f maps input vectors x to {True, False}. Given a collection of example inputs and outputs, our goal is to find an f consistent with all examples. Our evaluation metric is the time to find such a formula. The search space for this problem is extremely large. Even if we set n = m = 3 and restrict our search to aij { 1, 0, 1}5, b { 1, 0, 1}, the total number of candidate formulae is still 36 3 3 5.8 1025. We consider three MCMC methods: no restarts (0-stage), uniformly random restarts (1-stage), and a staged method (2-stage) as in Section 3.1. All base chains perform Metropolis-Hastings using proposals that edit individual atoms (e.g., [a ijx bij]), either by changing a single entry of [aij bij] or by changing all entries of [aij bij] at once. For the staged method, we initialize f uniformly at random, take Geometric(0.04) transitions based on a simplified cost function, then take Geometric(0.0002) steps with the full cost (this is the staged Doeblin chain in Figure 2). The full cost function is I(f), the number of examples f errs on. We stop the Markov chain when it finds a formula with I(f) = 0. The simplified cost function decomposes over the disjuncts: for each disjunct d(x), if f(x) = False while d(x) = True, we incur a large cost (since in order for f(x) to be false, all disjuncts comprising f(x) must also be false). If f(x) = True while d(x) = False, then we incur a smaller cost. If f(x) = d(x) then we incur no cost. We used all three methods as a subroutine in verifying properties of C programs; each such verification requires solving many instances of DNF formula inference. Using the staged method we are able to obtain a 30% speedup over uniformly random restarts and a 50x improvement over no restarts, as shown in Table 1. Learning Fast-Mixing Models for Structured Prediction Table 1. Comparison of 3 different MCMC algorithms. 0-stage uses no restarts, 1-stage uses random restarts, and 2-stage uses random restarts followed by a short period of MH with a simplified cost function. The table gives mean time and standard error (in seconds) taken to verify 5 different C programs, averaged over 1000 trials. Each verification requires inferring many DNF formulae as a sub-routine. Task fig1 cegar2 nested tacas06 hard 0-stage 2.6 1.0 320 9.3 120 7.0 600 600 1-stage 0.074 0.001 0.41 0.01 2.4 0.10 6.8 0.15 52 1.5 2-stage 0.055 0.005 0.33 0.007 2.3 0.12 4.6 0.12 31 0.90 6. Discussion We have proposed a model family based on strong Doeblin Markov chains, which guarantee fast mixing. Our construction allows us to simultaneously leverage a simple, tractable model (uθ) that provides coverage together with a complex, accurate model (Aθ) that provides precision. As such, we sidestep a typical dilemma whether to use a simple model with exact inference, or to deal with the consequences of approximate inference in a more complex model. While our approach works well in practice, there are still some outstanding issues. One is the non-convexity of the learning objective, which makes the procedure dependent on initialization. Another issue is that the gradients returned by Algorithm 1 can be large, heterogeneous, and high-variance. The adaptive nature of ADAGRAD alleviates this somewhat, but ideally we would like a sampling procedure that has lower variance than Algorithm 1. Though Gibbs sampling is the de facto method for many practitioners, there are also many more sophisticated approaches to MCMC (Green, 1995; Earl & Deem, 2005). Since our framework is orthogonal to the particular choice of transition kernel, it would be interesting to apply our method in these contexts. Finally, we would like to further explore the staged construction from Section 3.1. As the initial results on DNF formula synthesis are encouraging, it would be interesting to apply the construction to high-dimensional feature spaces as well as rich, multi-level hierarchies. We believe this might be a promising approach for extremely rich models in which a single level of re-initialization is insufficient to capture the complexity of the cost landscape. Related Work. Our learning algorithm is reminiscent of policy gradient algorithms in reinforcement learning (Sutton et al., 1999), as well as Searn, which tries to learn an optimal search policy for structured prediction (Daume et al., 2009). Shi et al. (2015) applied reinforcement learning to choose which variable to update next in a Gibbs sampler. Our staged construction is also similar in spirit to path sampling (Gelman & Meng, 1998), which uses a multi-stage approach to smoothly transition from a very simple to a very complex distribution. Our staged Doeblin construction belongs to the family of coarse-to-fine inference methods, which operate on progressively more complex models (Viola & Jones, 2004; Shen et al., 2004; Collins & Koo, 2005; Gu et al., 2009; Weiss et al., 2010; Sapp et al., 2010; Petrov & Klein, 2007; Yadollahpour et al., 2013). On the theoretical front, we make use of the well-developed theory of strong Doeblin chains, often also referred to with the terms minorization or regeneration time (Doeblin, 1940; Roberts & Tweedie, 1999; Meyn & Tweedie, 1994; Athreya & Ney, 1978). The strong Doeblin property is typically used to study convergence of continuous-space Markov chains, but Rosenthal (1995) has used it to analyze Gibbs sampling, and several authors have provided algorithms for sampling exactly from arbitrary strong Doeblin chains (Propp & Wilson, 1996; Corcoran & Tweedie, 1998; Murdoch & Green, 1998). We are the first to use strong Doeblin properties to construct model families and learn them from data. At a high level, our idea is to identify a family of models for which an approximate inference algorithm is known to work well, thereby constructing a computationally tractable model family that is nevertheless more expressive than typical tractable families such as low-treewidth graphical models. We believe this general research program is very interesting, and could be applied to other inference algorithms as well, thus solidfying the link between statistical theory and practical reality. Acknowledgments. The first author was supported by a Fannie & John Hertz Fellowship as well as an NSF Graduate Fellowship. The second author was supported by a Microsoft Research Faculty Fellowship. We also thank the anonymous referees for their helpful comments. Reproducibility. Code, data, and experiments for this paper are available on the Coda Lab platform at https://www.codalab.org/worksheets/ 0xc6edf0c9bec643ac9e74418bd6ad4136/. Learning Fast-Mixing Models for Structured Prediction Athreya, K. B. and Ney, P. A new approach to the limit theory of recurrent Markov chains. Transactions of the American Mathematical Society, 245:493 501, 1978. Collins, M. and Koo, T. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1): 25 70, 2005. Corcoran, J. and Tweedie, R. Perfect sampling of Harris recurrent Markov chains. preprint, 1998. Cowles, M. K. and Carlin, B. P. Markov chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association (JASA), 91(434): 883 904, 1996. Daume, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine Learning, 75:297 325, 2009. Doeblin, W. Elements d une theorie generale des chaines simples constantes de markoff. In Annales scientifiques de l Ecole Normale Sup erieure, volume 57, pp. 61 111, 1940. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010. Earl, D. J. and Deem, M. W. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910 3916, 2005. Gelman, A. and Meng, X. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, 13(2):163 185, 1998. Gelman, A. and Rubin, D. B. A single series from the Gibbs sampler provides a false sense of security. Bayesian statistics, 4:625 631, 1992. Green, P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711 732, 1995. Gu, C., Lim, J. J., Arbel aez, P., and Malik, J. Recognition using regions. In Computer Vision and Pattern Recognition (CVPR), pp. 1030 1037, 2009. Gulwani, S. and Jojic, N. Program verification as probabilistic inference. ACM SIGPLAN Notices, 42(1):277 289, 2007. Huang, L., Fayong, S., and Guo, Y. Structured Perceptron with inexact search. In Association for Computational Linguistics (ACL), pp. 142 151, 2012. Kulesza, A. and Pereira, F. Structured learning with approximate inference. In Advances in Neural Information Processing Systems (NIPS), pp. 785 792, 2007. Levin, D., Peres, Y., and Wilmer, E. Markov Chains and Mixing Times. American Mathematical Society, 2008. Meyn, S. and Tweedie, R. Computable bounds for geometric convergence rates of Markov chains. The Annals of Applied Probability, 4(4):981 1011, 1994. Murdoch, D. J. and Green, P. J. Exact sampling from a continuous state space. Scandinavian Journal of Statistics, 25(3):483 502, 1998. Murray, I. and Salakhutdinov, R. Notes on the KLdivergence between a Markov chain and its equilibrium distribution. preprint, 2008. Petrov, S. and Klein, D. Learning and inference for hierarchically split PCFGs. In Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL), pp. 404 411, 2007. Propp, J. and Wilson, D. Exact sampling with coupled Markov chains and applications to statistical mechanics. Random structures and Algorithms, 9:223 252, 1996. Roberts, G. and Tweedie, R. Bounds on regeneration times and convergence rates for Markov chains. Stochastic Processes and their applications, 80(2):211 229, 1999. Rosenthal, J. S. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association (JASA), 90(430):558 566, 1995. Sandhaus, E. The New York Times annotated corpus, 2008. Sapp, B., Toshev, A., and Taskar, B. Cascaded models for articulated pose estimation. In European Conference on Computer Vision (ECCV), pp. 406 420, 2010. Sharma, R. and Aiken, A. From invariant checking to invariant inference using randomized search. In Computer Aided Verification (CAV), pp. 88 105, 2014. Shen, L., Sarkar, A., and Och, F. J. Discriminative reranking for machine translation. In North American Association for Computational Linguistics (NAACL), pp. 177 184, 2004. Shi, T., Steinhardt, J., and Liang, P. Learning where to sample in structured prediction. In Artificial Intelligence and Statistics (AISTATS), pp. 875 884, 2015. Sutton, R., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1999. Learning Fast-Mixing Models for Structured Prediction Viola, P. and Jones, M. J. Robust real-time face detection. International Journal of Computer Vision, 57(2): 137 154, 2004. Wainwright, M. Estimating the wrong graphical model: Benefits in the computation-limited setting. Journal of Machine Learning Research (JMLR), 7:1829 1859, 2006. Weiss, D., Sapp, B., and Taskar, B. Sidestepping intractable inference with structured ensemble cascades. In Advances in Neural Information Processing Systems (NIPS), pp. 2415 2423, 2010. Yadollahpour, P., Batra, D., and Shakhnarovich, G. Discriminative re-ranking of diverse segmentations. In Computer Vision and Pattern Recognition (CVPR), pp. 1923 1930, 2013.