# hamiltonian_monte_carlo_without_detailed_balance__0ce9d097.pdf Hamiltonian Monte Carlo Without Detailed Balance Jascha Sohl-Dickstein JASCHA@STANFORD.EDU Stanford University, Palo Alto. Khan Academy, Mountain View Mayur Mudigonda MUDIGONDA@BERKELEY.EDU Redwood Institute for Theoretical Neuroscience, University of California at Berkeley Michael R. De Weese DEWEESE@BERKELEY.EDU Redwood Institute for Theoretical Neuroscience, University of California at Berkeley We present a method for performing Hamiltonian Monte Carlo that largely eliminates sample rejection. In situations that would normally lead to rejection, instead a longer trajectory is computed until a new state is reached that can be accepted. This is achieved using Markov chain transitions that satisfy the fixed point equation, but do not satisfy detailed balance. The resulting algorithm significantly suppresses the random walk behavior and wasted function evaluations that are typically the consequence of update rejection. We demonstrate a greater than factor of two improvement in mixing time on three test problems. We release the source code as Python and MATLAB packages. 1. Introduction High dimensional and otherwise computationally expensive probabilistic models are of increasing importance for such diverse tasks as modeling the folding of proteins (Sch utte & Fischer, 1999), the structure of natural images (Culpepper et al., 2011), or the activity of networks of neurons (Cadieu & Koepsell, 2010). Sampling from the described distribution is typically the bottleneck when working with these probabilistic models. Sampling is commonly required when training a probabilistic model, when evaluating the model s performance, when performing inference, and when taking expectations (Mac Kay, 2003). Therefore, work that improves sampling is fundamentally important. The most common way to guarantee that a sampling algo- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). rithm converges to the correct distribution is via a concept known as detailed balance. Sampling algorithms based on detailed balance are powerful because they allow samples from any target distribution to be generated from almost any proposal distribution, using for instance Metropolis Hastings acceptance criteria (Hastings, 1970). However, detailed balance also suffers from a critical flaw. Precisely because the forward and reverse transitions occur with equal probability, detailed balance driven samplers go backwards exactly as often as they go forwards. The state space is thus explored via a random walk over distances longer than those traversed by a single draw from the proposal distribution. A random walk only travels a distance d N 1 2 in N steps, where d is the characteristic step length. The current state-of-the-art sampling algorithm for probability distributions with continuous state spaces is Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2010). By extending the state space to include auxiliary momentum variables, and then using Hamiltonian dynamics to traverse long iso-probability contours in this extended state space, HMC is able to move long distances in state space in a single update step. However, HMC still relies on detailed balance to accept or reject steps, and as a result still behaves like a random walk just a random walk with a longer step length. Previous attempts to address this have combined multiple Markov steps that individually satisfy detailed balance into a composite step that does not (Horowitz, 1991), with limited success (Kennedy & Pendleton, 1991). The No-U-Turn Sampler (NUTS) sampling package (Hoffman & Gelman, 2011) and the windowed acceptance method of (Neal, 1994) both consider Markov transitions within a set of discrete states generated by repeatedly simulating Hamiltonian dynamics. NUTS generates a set of candidate states around the starting state by running Hamiltonian dynamics forwards and backwards until the trajectory doubles back on itself, or a slice variable constraint is violated. It then chooses a new state at uniform from the Hamiltonian Monte Carlo Without Detailed Balance candidate states. In windowed acceptance, a transition is proposed between a window of states at the beginning and end of a trajectory, rather than the first state and last state. With the selected window, a single state is then chosen using Boltzmann weightings. Both NUTS and the windowed acceptance method rely on detailed balance to choose the candidate state from the discrete set. Here we present a novel discrete representation of the HMC state space and transitions. Using this representation, we derive a method for performing HMC while abandoning detailed balance altogether, by directly satisfying the fixed point equation restricted to the discrete state space. As a result, random walk behavior in the sampling algorithm is greatly reduced, and the mixing rate of the sampler is substantially improved. 2. Sampling We begin by briefly reviewing some key concepts related to sampling. The goal of a sampling algorithm is to draw characteristic samples x RN from a target probability distribution p (x). Without loss of generality, we will assume that p (x) is determined by an energy function E (x), Z exp ( E (x)) . (1) 2.1. Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) (Neal, 1993) is commonly used to sample from probabilistic models. In MCMC a chain of samples is generated by repeatedly drawing new samples x from a conditional probability distribution T (x |x), where x is the previous sample. Since T (x |x) is a probability density over x , R T (x |x) dx = 1 and T (x |x) 0. 2.2. Fixed Point Equation An MCMC algorithm must satisfy two conditions in order to generate samples from the target distribution p (x). The first is mixing, which requires that repeated application of T (x |x) must eventually explore the full state space of p (x). The second condition is that the target distribution p (x) must be a fixed point of T (x |x). This second condition can be expressed by the fixed point equation, Z p (x) T (x |x) dx = p (x ) , (2) which requires that when T (x |x) acts on p (x), the resulting distribution is unchanged. 2.3. Detailed Balance Detailed balance is the most common way of guaranteeing that the Markov transition distribution T (x |x) satisfies the fixed point equation (Equation 2). Detailed balance guarantees that if samples are drawn from the equilibrium distribution p (x), then for every pair of states x and x the probability of transitioning from state x to state x is identical to that of transitioning from state x to x, p (x) T (x |x) = p (x ) T (x|x ) . (3) By substitution for T (x |x) in the left side of Equation 2, it can be seen that if Equation 3 is satisfied, then the fixed point equation is also satisfied. An appealing aspect of detailed balance is that a transition distribution satisfying it can be easily constructed from nearly any proposal distribution, using Metropolis Hastings acceptance/rejection rules (Hastings, 1970). A primary drawback of detailed balance, and of Metropolis Hastings, is that the resulting Markov chains always engage in random walk behavior, since by definition detailed balance depends on forward and reverse transitions happening with equal probability. The primary advance in this paper is demonstrating how HMC sampling can be performed without resorting to detailed balance. 3. Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) can traverse long distances in state space with single Markov transitions. It does this by extending the state space to include auxiliary momentum variables, and then simulating Hamiltonian dynamics to move long distances along iso-probability contours in the expanded state space. 3.1. Extended state space The state space is extended by the addition of momentum variables v RN, with identity-covariance Gaussian distribution, p (v) = (2π) N 2v T v . (4) We refer to the combined state space of x and v as ζ, such that ζ = {x, v}. The corresponding joint distribution is p (ζ) = p (x, v) = p (x) p (v) = (2π) N Z exp ( H (ζ)) , H (ζ) = H (x, v) = E (x) + 1 2v T v. (6) H (ζ) has the same form as total energy in a physical system, where E (x) is the potential energy for position x and 1 2v T v is the kinetic energy for momentum v (mass is set to one). Hamiltonian Monte Carlo Without Detailed Balance In HMC samples from p (x) are generated by drawing samples from the joint distribution p (x, v), and using only the x variables as samples from the desired distribution. 3.2. Hamiltonian dynamics Hamiltonian dynamics govern how physical systems evolve with time. It might be useful to imagine the trajectory of a skateboarder rolling in an empty swimming pool. As she rolls downwards she exchanges potential energy for kinetic energy, and the magnitude of her velocity increases. As she rolls up again she exchanges kinetic energy back for potential energy. In this fashion she is able to traverse long distances across the swimming pool, while at the same time maintaining constant total energy over her entire trajectory. In HMC, we treat H (ζ) as the total energy of a physical system, with spatial coordinate x, velocity v, potential energy E (x), and kinetic energy 1 2v T v. In an identical fashion to the case of the skateboarder in the swimming pool, running Hamiltonian dynamics on this system traverses long distances in x while maintaining constant total energy H (ζ). By Equation 5, moving along a trajectory with constant energy is identical to moving along a trajectory with constant probability. Hamiltonian dynamics can be run exactly in reverse by reversing the velocity vector. They also preserve volume in ζ. As we will see, all these properties together mean that Hamiltonian dyamics can be used to propose update steps that move long distances in state space while retaining high acceptance probability. 3.3. Operators The Markov transitions from which HMC is constructed can be understood in terms of several operators acting on ζ. These operators are illustrated in Figure 1a. This representation of the actions performed in HMC, and the corresponding state space, is unique to this paper and diverges from the typical presentation of HMC. 3.3.1. MOMENTUM FLIP The momentum flip operator F reverses the direction of the momentum. It is its own inverse, leaves the total energy unchanged, and preserves volume in state space: Fζ = F {x, v} = {x, v} , (7) F 1ζ = Fζ, (8) H (Fζ) = H (ζ) , (9) det Fζ The momentum flip operator F causes movement between the left and right sides of the state ladder in Figure 1b. Figure 1. (a) The action of operators involved in Hamiltonian Monte Carlo (HMC). The base of each red or green arrow represents the position x, and the length and direction of each of these arrows represents the momentum v. The flip operator F reverses the momentum. The leapfrog operator L approximately integrates Hamiltonian dynamics. The trajectory taken by L is indicated by the dotted line. The randomization operator R (β) corrupts the momentum with an amount of noise that depends on β. (b) The ladder of discrete states that are accessible by applying F and L starting at state ζ. Horizontal movement on the ladder occurs by flipping the momentum, whereas vertical movement occurs by integrating Hamiltonian dynamics. 3.3.2. LEAPFROG INTEGRATOR Leapfrog, or St ormer-Verlet, integration provides a discrete time approximation to Hamiltonian dynamics (Hairer et al., 2003). The operator L (ϵ, M) performs leapfrog integration for M leapfrog steps with step length ϵ. For conciseness, L (ϵ, M) will be written only as L, (The state resulting from M steps of leapfrog integration of Hamiltonian dynamics with step length ϵ. (11) Like exact Hamiltonian dynamics, leapfrog dynamics are exactly reversible by reversing the velocity vector, and they also exactly preserve volume in state space. L can be inverted by reversing the sign of the momentum, tracing out the reverse trajectory, and then reversing the sign of the momentum again so that it points in the original direction; L 1ζ = FLFζ, (12) det Lζ Unlike for exact dynamics, the total energy H (ζ) is only approximately conserved by leapfrog integration, and the energy accumulates errors due to discretization. This discretization error in the energy is the source of all rejections of proposed updates in HMC. The leapfrog operator L causes movement up the right side Hamiltonian Monte Carlo Without Detailed Balance of the state ladder in Figure 1b, and down the left side of the ladder. 3.3.3. MOMENTUM RANDOMIZATION The momentum randomization operator R (β) mixes an amount of Gaussian noise determined by β [0, 1] into the velocity vector, R (β) ζ = R (β) {x, v} = {x, v } , (14) n N (0, I) . (16) Unlike the previous two operators, the momentum randomization operator is not deterministic. R (β) is however a valid Markov transition operator for p (ζ) on its own, in that it satisfies both Equation 2 and Equation 3. The momentum randomization operator R (β) causes movement off of the current state ladder and on to a new state ladder. 3.4. Discrete State Space As illustrated in Figure 1b, the operators L and F generate a discrete state space ladder, with transitions only occurring between ζ and three other states. Note that every state on the ladder can be represented many different ways, depending on the series of operators used to reach it. For instance, the state in the upper left of the figure pane can be written L 1Fζ = FLζ = LFLLζ = . Standard HMC can be viewed in terms of transitions on this ladder. Additionally, we will see that this discrete state space view allows Equation 2 to be solved directly by replacing the integral over all states with a short sum. 3.5. Standard HMC HMC as typically implemented consists of the following steps. Here, ζ(t,s) represents the state at sampling step t, and sampling substep s. Each numbered item below corresponds to a valid Markov transition for p (ζ), satisfying detailed balance. A full sampling step consists of the composition of all three Markov transitions. 1. (a) Generate a proposed update, ζ = FLζ(t,0). (17) On the state ladder in Figure 1b, this corresponds to moving up one rung (L), and then moving from the right to the left side (F). (b) Accept or reject the proposed update using Metropolis-Hastings rules, πaccept = min 1, p (ζ ) ζ(t,1) = ζ with probability πaccept ζ(t,0) with probability 1 πaccept . Note that since the transition FL is its own inverse, the forward and reverse proposal distribution probabilities cancel in the Metropolis Hastings rule in Equation 18. On rejection, the computations performed in Equation 17 are discarded. In our new technique, this will no longer be true. 2. Flip the momentum, ζ(t,2) = Fζ(t,1). (20) If the proposed update from Step 1 was accepted, then this moves ζ(t,1) from the left back to the right side of the state ladder in Figure 1b, and prevents the trajectory from doubling back on itself. If the update was rejected however, and ζ(t,1) is already on the right side of the ladder, then this causes it to move to the left side of the ladder, and the trajectory to double back on itself. Doubling back on an already computed trajectory is wasteful in HMC, both because it involves recomputing nearly redundant trajectories, and because the distance traveled before the sampler doubles back is the characteristic length scale beyond which HMC explores the state space by diffusion. 3. Corrupt the momentum with noise, ζ(t+1,0) = R (β) ζ(t,2). (21) It is common to set β = 1, in which case the momentum is fully randomized every sampling step. In our experiments (Section 5) however, we found that smaller values of β produced large improvements in mixing time. This is therefore a hyperparameter that is probably worth adjusting1. 4. Look Ahead HMC Here we introduce an HMC algorithm that relies on Markov transitions that do not obey detailed balance, but 1One method for choosing β (Culpepper et al., 2011) which we have found to be effective is to set it such that it randomizes a fixed fraction α of the momentum per unit simulation time, β = α 1 ϵM . (22) Hamiltonian Monte Carlo Without Detailed Balance Gradient Evaluations Autocorrelation 2D Anisotropic Gaussian HMC β=1 LAHMC β=1 HMC β=0.1 LAHMC β=0.1 Gradient Evaluations Autocorrelation 100D Anisotropic Gaussian HMC β=1 LAHMC β=1 HMC β=0.1 LAHMC β=0.1 0 2000 4000 6000 8000 10000 Gradient Evaluations Autocorrelation 2d Rough Well HMC β=1 LAHMC β=1 HMC β=0.1 LAHMC β=0.1 Figure 2. Autocorrelation vs. number of function evaluations for standard HMC (no momentum randomization, β = 1), LAHMC with β = 1, persistent HMC (β = 0.1), and persistent LAHMC (β = 0.1) for (a) a two dimensional ill-conditioned Gaussian, (b) a one hundred dimensional ill-conditioned Gaussian, and (c) a two dimensional well conditioned energy function with a rough surface. In all cases the LAHMC sampler demonstrates faster mixing. still satisfy the fixed point equation. This algorithm eliminates much of the momentum flipping that occurs on rejection in HMC, and as a result greatly reduces random walk behavior. It also prevents the trajectory computations that would typically be discarded on proposal rejection from being wasted. We call our algorithm Look Ahead Hamiltonian Monte Carlo (LAHMC). 4.1. Intuition In LAHMC, in situations that would correspond to a rejection in Step 1 of Section 3.5, we will instead attempt to travel even farther by applying the leapfrog operator L additional times. This section provides intuition for how this update rule was discovered, and how it can be seen to connect to standard HMC. A more mathematically precise description will follow in the next several sections. LAHMC can be understood in terms of a series of modifications of standard HMC. The net effect of Steps 1 and 2 in Section 3.5 is to transition from state ζ into either state Lζ or state Fζ, depending on whether the update in Section 3.5 Step 1 was accepted or rejected. We wish to minimize the transitions into state Fζ. In LAHMC we do this by replacing as many transitions from ζ to Fζ as possible with transitions that instead go from ζ to L2ζ. This would seem to change the number of transitions into both state Fζ and state L2ζ, violating the fixed point equation. However, the changes in incoming transitions from ζ are exactly counteracted because the state FL2ζ is similarly modified, so that it makes fewer transitions into the state L2ζ = F FL2ζ , and more transitions into the state Fζ = L2 FL2ζ . For some states, after this modification there will still be transitions between the states ζ and Fζ. In order to further minimize these transitions, the process in the proceeding paragraph is repeated for these remaining transitions and the state L3ζ. This process is then repeated again for states L4ζ, L5ζ, etc, up to some maximum number of leapfrog applications K. 4.2. Algorithm LAHMC consists of the following two steps, 1. Transition to a new state by applying the leapfrog operator L between 1 and K Z+ times, or by applying the momentum flip operator F, Lζ(t,0) with probability πL1 ζ(t,0) L2ζ(t,0) with probability πL2 ζ(t,0) LKζ(t,0) with probability πLK ζ(t,0) Fζ(t,0) with probability πF ζ(t,0) Hamiltonian Monte Carlo Without Detailed Balance Distribution Sampler Fζ Lζ L2ζ L3ζ L4ζ 2d Gaussian HMC β = 1 0.079 0.921 0 0 0 2d Gaussian LAHMC β = 1 0.000 0.921 0.035 0.044 0.000 2d Gaussian HMC β = 0.1 0.080 0.920 0 0 0 2d Gaussian LAHMC β = 0.1 0.000 0.921 0.035 0.044 0.000 100d Gaussian HMC β = 1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 1 0.047 0.852 0.059 0.035 0.006 100d Gaussian HMC β = 0.1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 0.1 0.047 0.852 0.059 0.035 0.006 2d Rough Well HMC β = 1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 1 0.292 0.554 0.099 0.036 0.019 2d Rough Well HMC β = 0.1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 0.1 0.292 0.554 0.100 0.036 0.019 Table 1. A table showing the fraction of transitions which occurred to each target state for the conditions plotted in Figure 2. Note that LAHMC has far fewer momentum flips than standard HMC. Note that there is no longer a Metropolis-Hastings accept/reject step. The state update in Equation 23 is a valid Markov transition for p (ζ) on its own. 2. Corrupt the momentum with noise in an identical fashion as in Equation 21, ζ(t+1,0) = R (β) ζ(t,1). (24) 4.3. Transition Probabilities We choose the probabilities πLa (ζ) for the leapfrog transitions from state ζ to state Laζ to be πLa (ζ) = min 1 X b