# magnetic_hamiltonian_monte_carlo__757cc0c5.pdf

Magnetic Hamiltonian Monte Carlo

Nilesh Tripuraneni 1 Mark Rowland 2 Zoubin Ghahramani 2 3 Richard Turner 2

Abstract Hamiltonian Monte Carlo (HMC) exploits Hamiltonian dynamics to construct efﬁcient proposals for Markov chain Monte Carlo (MCMC). In this paper, we present a generalization of HMC which exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the mechanics of a charged particle coupled to a magnetic ﬁeld. We establish a theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC, and construct a symplectic, leapfrog-like integrator allowing for the implementation of magnetic HMC. Finally, we exhibit several examples where these non-canonical dynamics can lead to improved mixing of magnetic HMC relative to ordinary HMC.

1. Introduction

Probabilistic inference in complex models generally requires the evaluation of intractable, high-dimensional integrals. One powerful and generic approach to inference is to use Markov chain Monte Carlo (MCMC) methods to generate asymptotically exact (but correlated) samples from a posterior distribution for inference and learning. Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011) is a state-of-the-art MCMC method which uses gradient information from an absolutely continuous target density to encourage efﬁcient sampling and exploration. Crucially, HMC utilizes proposals inspired by Hamiltonian dynamics (corresponding to the classical mechanics of a point particle) which can traverse long distances in parameter space. HMC, and variants like NUTS (which eliminates the need to hand-tune the algorithm s hyperparameters), have been successfully applied to a large class of probabilistic inference problems where they are often the gold standard for

1UC Berkeley, USA 2University of Cambridge, UK 3Uber AI Labs, USA. Correspondence to: Nilesh Tripuraneni <nileshtrip@gmail.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

(asymptotically) exact inference (Neal, 1996; Hoffman & Gelman, 2014; Carpenter et al., 2016).

Figure 1. Example sample paths for standard HMC (left) and MHMC (right) for an isotropic Gaussian target distribution.

In this paper, we ﬁrst review important properties of Hamiltonian dynamics, namely energy-preservation, symplecticity, and time-reversibility, and derive a more general class of dynamics with these properties which we refer to as noncanonical Hamiltonian dynamics. We then discuss the relationship of non-canonical Hamiltonian dynamics to wellknown variants of HMC and propose a novel extension of HMC. We refer to this method as magnetic HMC (see Algorithm 1) since it corresponds to a particular subset of the non-canonical dynamics that in 3 dimensions map onto to the mechanics of a charged particle coupled to a magnetic ﬁeld see Figure 1 for an example of these dynamics. Furthermore, we construct an explicit, symplectic, leapfroglike integrator for magnetic HMC which allows for an efﬁcient numerical integration scheme comparable to that of ordinary HMC. Finally, we evaluate the performance of magnetic HMC on several sampling problems where we show how its non-canonical dynamics can lead to improved mixing. The proofs of all results in this paper are presented in the corresponding sections of the Appendix.

2. Markov chain Monte Carlo

Given an unnormalized target density ρ(θ) deﬁned on Rd, an MCMC algorithm constructs an ergodic Markov chain (Θn)n N such that the distribution of Θn converges to ρ (e.g. in total variation) (Robert & Casella, 2004). Often, the transition kernel of such a Markov chain is speciﬁed by the Metropolis-Hastings (MH) algorithm which (i) given the

Magnetic Hamiltonian Monte Carlo

current state Θn = θ, proposes a new state eθ by sampling from a proposal distribution Q( |θ), and (ii) sets Θn+1 = eθ

with probability min 1, ρ(eθ)Q(θ|eθ)

ρ(θ)Q(eθ|θ)

and Θn+1 = θ otherwise. The role of the acceptance step is to enforce reversibility (or detailed balance) of the Markov chain with respect to ρ which implies ρ is a stationary distribution of the transition kernel.

Heuristically, a good MH algorithm should have low intersample correlation while maintaining a high acceptance ratio. Hamiltonian Monte Carlo provides an elegant mechanism to do this by simulating a particle moving along the contour lines of a dynamical system, constructed from the target density, to use as a MCMC proposal.

2.1. Hamiltonian Monte Carlo

In Hamiltonian Monte Carlo, the target distribution is augmented with momentum variables p which are independent of the θ variables but of equal dimension. For the remainder of the paper, we take the distribution over the momentum variables to be Gaussian, as is common in the literature (indeed, there is evidence that in many cases, the choice of a Gaussian distribution may be optimal (Betancourt, 2017)). The joint target distribution is therefore:

ρ(θ, p) e U(θ) p p/2 e H(θ,p). (1)

Crucially, this augmentation allows Hamiltonian dynamics to be used as a proposal for an MCMC algorithm over the space (θ, p), where we interpret θ (resp., p) as position (resp., momentum) coordinates of a physical particle with total energy H(θ, p), given by the sum of its potential energy U(θ) and kinetic energy p p/2. We brieﬂy review the Markov chain construction below; see (Neal, 2011) or (Duane et al., 1987) for a more detailed description. Given the Markov chain state (θn, pn) at time n, the new state for time n + 1 is obtained by ﬁrst resampling momentum pn N(0, I), and then proposing a new state according to the following steps: (i) Simulate the deterministic Hamiltonian ﬂow deﬁned by the differential equation

θH(p(t), θ(t)) p H(p(t), θ(t))

p(t) θU(θ(t))

for time τ, with initial condition (θn, pn), to obtain (θ n, p n) = Φτ,H(θn, pn)1; (ii) Flip the resulting momentum component with the map Φp(θ, p) = (θ, p) to ob-

1Throughout this paper, we use Φτ,H to denote the map that takes a given position-momentum pair as initial conditions for the

tain (eθn+1, epn+1) = Φp(θ n, p n) = eΦτ,H(θn, pn); (iii) Apply a MH-type accept/reject step to enforce detailed balance with respect to the target distribution; (iv) Flip the momentum again with Φp so it points in the original direction.

Note that because the map Φτ,H is time-reversible (in the sense that if the path (θ(t), p(t)) is a solution to (2) then the path with negated momentum traversed in reverse (θ( t), p( t)) is also a solution), the map eΦτ,H is selfinverse. From this, the acceptance ratio in step (iii) enforcing detailed balance can be shown (see e.g. (Green, 1995)) to have the form:

1, exp( H(eθn+1, epn+1))

exp( H(θn, pn))

det θ,p eΦτ,H(θn, pn)

Note that the Hamiltonian ﬂow & momentum ﬂip operator eΦτ,H is volume-preserving2, which immediately yields that the Jacobian term in the acceptance ratio (3) is simply 1. The acceptance probability therefore reduces to min(1, exp(H(θn, pn) H(eθn+1, epn+1))). Furthermore, since the Hamiltonian ﬂow deﬁned in (2) is energypreserving (i.e. H(eθn+1, epn+1) = H(θn, pn)) the acceptance ratio is identically 1. Moreover, the momentum resampling in (i) and momentum ﬂip in (iv) both leave the joint distribution invariant.

While the momentum resampling ensures the Markov chain explores the joint (θ, p) space, the proposals inspired by Hamiltonian dynamics can traverse long distances in parameter space θ, reducing the random-walk behavior of MH that often results in highly correlated samples (Neal, 2011).

2.2. Symplectic Numerical Integration

Unfortunately, it is rarely possible to integrate the ﬂow deﬁned in (2) analytically; instead an efﬁcient numerical integration scheme must be used to generate a proposal for the MH-type accept/reject test. Typically, the leapfrog (St ormer-Verlet) integrator is used since it is an explicit method that is both symplectic and time-reversible (Neal, 2011). One elegant way to motivate this integrator is by decomposing the Hamiltonian into a symmetric splitting:

H(θ, p) = U(θ)/2 | {z } H1(θ)

+ p p/2 | {z } H2(p)

+ U(θ)/2 | {z } H1(θ)

and then deﬁning Φϵ,H1(θ) and Φϵ,H2(p) to be the exactlyintegrated ﬂows for the sub-Hamiltonians H1(θ) and

Hamiltonian ﬂow associated with H for time τ. In addition, eΦτ,H denotes the composition of Φτ,H with the momentum ﬂip map Φp. 2In fact the Hamiltonian ﬂow satisﬁes the stronger condition of symplecticity with respect to the A matrix ([ θ,pΦτ,H(θ, p)] A 1[ θ,pΦτ,H(θ, p)] = A 1) which immediately implies it is volume-preserving by taking determinants of this relation.

Magnetic Hamiltonian Monte Carlo

H2(p), respectively. These updates (which are equivalent to Euler translations) can be written:

since the Hamilton equations (2) for the sub-Hamiltonians H1(θ) and H2(p) are linear, and hence analytically integrable. One leapfrog step is then deﬁned as:

Φfrog ϵ,H(θ,p) = Φϵ,H1(θ) Φϵ,H2(p) Φϵ,H1(θ) (6)

with the overall proposal given by L leapfrog steps, followed by the momentum ﬂip operator Φp as before:

eΦfrog L,ϵ,H = Φp Φfrog ϵ,H(θ,p) L .

As each of the ﬂows Φϵ,H1(θ), Φϵ,H2(p) exactly integrates a sub-Hamiltonian, they inherit the symplecticity, volumepreservation, and time-reversibility of the exact dynamics. Moreover, since the composition of symplectic ﬂows is also symplectic and the splitting scheme is symmetric (implying the composition of time-reversible ﬂows is also timereversible), the Jacobian term in the acceptance probability (3) is exactly 1 as in the case of perfect simulation.

The leapfrog scheme will not exactly preserve the Hamiltonian H, so the remaining acceptance ratio exp(H(θn, pn) H(eθn+1, epn+1)) must be calculated. However, the leapfrog integrator has error O(ϵ3) in one leapfrog step (Hairer et al., 2006). This error scaling will lead to good energy conservation properties (and thus high acceptance rates in the MH step), even when simulating over long trajectories.

3. Non-Canonical Hamiltonian Monte Carlo

In Section 2, we noted the role time-reversibility, volumepreservation, and energy conservation of canonical Hamiltonian dynamics play in making them useful candidates for MCMC. In this section, we develop the properties of a general class of ﬂows we refer to as non-canonical Hamiltonian systems that parallel these properties, we use to construct our method magnetic HMC (see Algorithm 1):

Lemma 1. The map ΦA τ,H(θ, p) deﬁned by integrating the non-canonical Hamiltonian system

= A θ,p H(θ(t), p(t)) (7)

with initial conditions (θ, p) for time τ, where A M2n 2n is any invertible, antisymmetric matrix induces a ﬂow on the coordinates (θ, p) that is still energyconserving ( τH(ΦA τ,H(θ, p)) = 0) and symplectic with

respect to A ([ θ,pΦτ,H(θ, p)] A 1[ θ,pΦτ,H(θ, p)] = A 1) which also implies volume-preservation of the ﬂow.

Within the formal construction of classical mechanics, it is known that any Hamiltonian ﬂow deﬁned on the cotangent bundle (θ, p) of a conﬁguration manifold, which is equipped with an arbitrary symplectic 2-form, will preserve its symplectic structure and admit the corresponding Hamiltonian as a ﬁrst integral invariant (Arnold, 1989). The statement of Lemma 1 is simply a restatement of this fact grounded in a coordinate system. Similar arbitrary, antisymmetric terms have also appeared in the study of MCMC algorithms based on diffusion processes; such samplers often do not enforce detailed balance with respect to the target density and are often implemented as discretizations of stochastic differential equations (Rey-Bellet & Spiliopoulos, 2015; Ma et al., 2015), in contrast to the approach taken here.

Our second observation is that the dynamics in (7) are not time-reversible in the traditional sense. Instead, if we consider the parametrization of A as:

A = E F F G

where E, G are antisymmetric and F is taken to be general such that A is invertible, then the non-canonical dynamics have a (pseudo) time-reversibility symmetry:

Lemma 2. If (θ(t), p(t)) is a solution to the non-canonical dynamics:

θH(θ(t), p(t)) p H(θ(t), p(t))

then (eθ(t), ep(t)) = (θ( t), p( t)) is a solution to the modiﬁed non-canonical dynamics:

eθ(t) ep(t)

" eθH(eθ(t), p(t)) ep H(eθ(t), ep(t))

if H(θ, p) = H(θ, p). In particular if E = G = 0 then A = e A, which reduces to the traditional time-reversal symmetry of canonical Hamiltonian dynamics.

Lemma 1 suggests a generalization of HMC that can utilize an arbitrary invertible antisymmetric A matrix in its dynamics; however Lemma 2 indicates the non-canonical dynamics lack a traditional time-reversibility symmetry which poses a potential difﬁculty to satisfying detailed balance. In particular, we cannot compose Φp with an exact/approximate simulation of ΦA τ,H to make eΦA τ,H = Φp ΦA τ,H self-inverse.

Magnetic Hamiltonian Monte Carlo

Our solution to obtaining a time-reversible proposal is simply to ﬂip the elements of the E and G matrices just as ordinary HMC ﬂips the auxiliary variable p i) at the end of Hamiltonian ﬂow in the proposal and ii) once again after the MH acceptance step to return p to its original direction. In this vein, we view the parameters E and G as auxiliary variables in the state space, and simultaneously ﬂip p, E, and G after having simulated the dynamics, rendering the proposal time-reversible according to Lemma 2 see Section 2 in the Appendix for full details of this construction. This ensures that detailed balance is satisﬁed for this entire proposal. To avoid random walk behaviour in the resulting Markov chain, we can apply a sign ﬂip to E and G, in addition to p, to return them to their original directions after the MH acceptance step.

The validity of this construction relies on equipping E and G with symmetric auxiliary distributions. For the remainder of this paper, we further restrict to binary symmetric auxiliary distributions supported on a given antisymmetric matrix V0 and its sign ﬂip V0 see Appendix 1.1 for full details. This restriction is not necessary, but gives rise to a simple and interpretable class of algorithms, which is in spirit closest to using ﬁxed parameters E and G, whilst ensuring the proposal satisﬁes detailed balance. This construction is also reminiscent of lifting constructions prevalent in the discrete Markov chain literature (Chen et al., 1999); heuristically, the signed variables E and G favour proposals in opposing directions.

3.1. Symplectic Numerical Integration for Non-Canonical Dynamics

As with standard HMC, exactly simulating the ﬂow ΦA τ,H is rarely tractable, and a numerical integrator is required to approximate the ﬂow. It is not absolutely necessary to use an explicit, symplectic integration scheme; indeed implicit integrators are used in Riemannian HMC to maintain symplecticity of the proposal which comes at a greater complexity and computational cost (Girolami et al., 2009). However explicit, symplectic integrators are simple, have good energy-conservation properties, and are volumepreserving/time-reversible (Hairer et al., 2006), so for the present discussion we restrict our attention to investigating leapfrog-like schemes.

We begin, as in Section 2.2, by considering the symmetric splitting (4), yielding the sub-Hamiltonians H1(θ) = U(θ)/2, H2(p) = p p/2. The corresponding noncanonical dynamics for the sub-Hamiltonians H1(θ) and H2(p) are:

= E θU(θ)/2 F θU(θ)/2

We denote the corresponding ﬂows by ΦA ϵ,H1(θ) and ΦA ϵ,H2(p) respectively. The ﬂow ΦA ϵ,H1(θ) is generally not explicitly tractable unless we take E = 0 in which case it is solved by an Euler translation as before. Crucially, the ﬂow in ΦA ϵ,H2(p) is a linear differential equation and hence analytically integrable. If G is invertible (and F = I) then:

= θ + G 1(exp(Gϵ) I)p exp(Gϵ)p

See the Appendix for a detailed derivation which also handles the general case where G is not invertible. Thus when E = 0, the ﬂows ΦA ϵ,H1(θ) and ΦA ϵ,H2(p) are analytically tractable and will inherit the generalized symplecticity and (pseudo) time-reversibility of the exact dynamics in (7). Therefore if we use the symmetric splitting (4) to construct a leapfrog-like step:

Φfrog,A ϵ,H(θ,p) = ΦA ϵ,H1(θ) ΦA ϵ,H2(p) ΦA ϵ,H1(θ) (12)

we can construct a total proposal that consists of several leapfrog steps, followed by a ﬂip of the momentum and G, Φp ΦG, which will be a volume-preserving, self-inverse map:

eΦfrog,A ϵ,H(θ,p) = Φp ΦG (ΦA ϵ,H1(θ) ΦA ϵ,H2(p) ΦA ϵ,H1(θ))L. (13)

Henceforth we will always take E = 0 when we use Φfrog,A ϵ,H(θ,p) to generate leapfrog proposals, which interestingly corresponds to a magnetic dynamics as discussed in Lemma 4. A full description of the magnetic HMC algorithm using this numerical integrator is described in Section 5.

4. Special Cases

Here, we describe several tractable subcases of the general formulation of non-canonical Hamiltonian dynamics since these they have interesting physical interpretations.

4.1. Mass Preconditioned Dynamics

One simple variant of HMC is preconditioned HMC where p N(0, M) (Neal, 2011), and can be implemented nearly identically to ordinary HMC. We note that preconditioning can be recovered within our framework using a simple form for the non-canonical A matrix:

Lemma 3. i) Preconditioned HMC with momentum variable p N(0, M) in the (θ, p) coordinates is exactly equivalent to simulating non-canonical HMC with p = M 1/2p N(0, I) and the non-canonical matrix A =

Magnetic Hamiltonian Monte Carlo 0 M1/2

, and then transforming back to

(θ, p) coordinates using p = M1/2p . Here M1/2 is a Cholesky factor for M.

ii) On the other hand if we apply a change of basis (via an invertible matrix F) to our coordinates θ = F 1θ, simulate HMC in the (θ , p) coordinates, and transform back to the original basis using F, this is exactly equivalent to

non-canonical HMC with A = 0 F F 0

Lemma 3 illustrates a fact alluded to in (Neal, 2011); using a mass preconditioning matrix M and a change of basis given by matrix (M 1/2) acting on θ leaves the HMC algorithm invariant.

4.2. Magnetic Dynamics

The primary focus of this paper is to investigate the subcase of the dynamics where:

A = 0 I I G

for two important reasons3. Firstly for this choice of A we can construct an explicit, symplectic, leapfrog-like integration scheme which is important for developing an efﬁcient HMC sampler as discussed in Section 3.1. Secondly, the dynamics have an interesting physical interpretation quite distinct from mass preconditioning and other HMC variants like Riemannian HMC (Girolami et al., 2009):

Lemma 4. In 3 dimensions the non-canonical Hamiltonian dynamics corresponding to the Hamiltonian H(θ, p) = U(θ) + 1

2p p and A matrix as in (14) are equivalent to the Newtonian mechanics of a charged particle (with unit mass and charge) coupled to a magnetic ﬁeld B (given by a particular function of G - see Appendix): d2θ dt2 = θU(θ) + dθ

This interpretation is perhaps surprising since Hamiltonian formulations of classical magnetism are uncommon, although the quantum mechanical treatment naturally incorporates a Hamiltonian framework. However, in light of Lemma 3 we might wonder if by a clever rewriting of the Hamiltonian we can reproduce this system of ODEs using the canonical A matrix (i.e. E = G = 0, F = I). This is not the case:

Lemma 5. The non-canonical Hamiltonian dynamics with magnetic A and Hamiltonian H(θ, p) = U(θ) + 1

2p p cannot be obtained using canonical Hamiltonian dynamics for any choice of smooth Hamiltonian. (See Appendix).

3Note that the effect of a non-identity F matrix can be achieved by simply composing these magnetic dynamics with a coordinate-transformation as suggested in Lemma 3.

5. The Magnetic HMC (MHMC) Algorithm

Using the results discussed in Section 3 and Section 3.1 we can now propose Magnetic HMC see Algorithm 1.

Algorithm 1 Magnetic HMC (MHMC) Input: H, G, L, ϵ

Initialize (θ0, p0), and set G0 G for n = 1, . . . , N do

Resample pn 1 N(0, I) Set (eθn, epn) LF(H, L, ϵ, (θn 1, pn 1, Gn 1)) with ΦA ϵ,H2(p) as in (11)

Flip momentum (eθn, epn) (eθn, epn) and set e Gn Gn 1 if Unif([0, 1]) < min(1, exp(H(θn 1, pn 1) H(eθn, epn))) then

Set (θn, pn, Gn) (eθn, epn, e Gn) else

Set (θn, pn, Gn) (θn 1, pn 1, Gn 1) end if Flip momentum pn pn and ﬂip Gn Gn end for Output: (θn)N n=0

One further remark is that by construction the integrator for magnetic HMC is expected to have similarly good energy conservation properties to the integrator of standard HMC:

Lemma 6. The symplectic leapfrog-like integrator for magnetic HMC will have the same local ( O(ϵ3)) and global ( O(ϵ2)) error scaling (over τ L

ϵ steps), as the canonical leapfrog integrator of standard HMC if the Hamiltonian is separable. (See Appendix). It is worthwhile to contrast the algorithmic differences between magnetic HMC and ordinary HMC. Intuitively, the role of the ﬂow ΦA ϵ,H2(p) which reduces to the standard Euler translation update of ordinary HMC when G = 0 is to introduce a rotation into the momentum space of the ﬂow. In particular, a non-zero element Gij will allow momentum to periodically ﬂow between pi and pj. If we regard G as an element in the Lie algebra of antisymmetric matrices, which can be thought of as inﬁnitesimal rotations, then the exponential map exp(Gϵ) will project this transformation into the Lie group of real orthogonal linear maps.

With respect to computational cost, although magnetic HMC requires matrix exponentiation/diagonalization to simulate ΦA ϵ,H2(p), this only needs to be computed once upfront for G and cached; moreover, as G is diagonalizable, the exact exponential can be calculated in O(d3) time. Additionally, there is an O(d2) cost for the matrix-vector products needed to implement the ﬂow ΦA ϵ,H2(p) as with preconditioning. However, it is possible to design sparsi-

Magnetic Hamiltonian Monte Carlo

ﬁed matrix representations of A which will translate into sparsiﬁed rotations if we only wish to curl in a speciﬁc subspace of dimension d0 which will translate into a computational cost of O(d3 0) and O(d2 0) respectively.

An important problem to address is the selection of the G matrix, which affords a great deal of ﬂexibility to MHMC relative to HMC; this point is further discussed in the Experiments section, where we argue that in certain cases intuitive heuristics can be used to select the G matrix.

6. Experiments

Here we investigate the performance of magnetic HMC against standard HMC in several examples; in each case commenting on our choice of the magnetic ﬁeld term G. Step sizes (ϵ) and number of leapfrog steps (L) were tuned to achieve an acceptance rate between .7 .8, after which the norm of the non-zero elements in G was set to .1 .2 which was found to work well.

In the Appendix we also display illustrations of different MHMC proposals across several targets in order to provide more intuition for MHMC s dynamics. Further experimental details and an additional experiment on a Gaussian funnel target are also provided in the Appendix.

6.1. Multiscale Gaussians

We consider two highly ill-conditioned Gaussians similar to as in (Sohl-Dickstein et al., 2014) to illustrate a heuristic for G matrix selection and demonstrate properties of the magnetic dynamics. In particular we consider a centered, uncorrelated 2D Gaussian with covariance eigenvalues of 106 and 1 as well as a centered, uncorrelated 10D Gaussian with two large covariance eigenvalues of 106 and remaining eigenvalues of 1. We denote their coordinates as x = (x1, x2) R2 and x = (x1, . . . , x10) R10 respectively. HMC will have difﬁculty exploring the directions of

Autocorrelation

Autocorrelation

Figure 2. Averaged Autocorrelation of HMC vs MHMC on a 10D ill-conditioned Gaussian (left) and Averaged Autocorrelation of HMC vs MHMC on a 2D ill-conditioned Gaussian.

large marginal variance since its exploration will often be limited by the smaller variance directions. Accordingly, in order to induce a periodic momentum ﬂow between the directions of small and large variance, we introduce nonzero components Gij into the subspaces spanned directly between the large and small eigenvalues. Indeed, we ﬁnd that magnetic G term is encouraging more efﬁcient exploration as we can see from the averaged autocorrelation of samples generated from the HMC/MHMC chains see Figure 2. Further, by running the 50 parallel chains for 107 timesteps, we computed both the bias and Monte Carlo standard errors (MCSE) of the estimators of the target moments as shown in Table 1 and Table 2.

Table 1. Absolute Bias MCSE for 2D ill-conditioned Gaussian moments for HMC vs. MHMC. Note that E[x2 1] = 106 and E[x2 2] = 1.

algorithm x2 1 (Bias MCSE) x2 2 (Bias MCSE)

HMC .947 41.7 .00108 .0247

MHMC .657 22.5 .000280 .00438

Table 2. Absolute Bias MCSE for 10D ill-conditioned Gaussian moments for HMC vs. MHMC. Note that E[x2 1] = 106 and E[x2 10] = 1.

algorithm x2 1 (Bias MCSE) x2 10 (Bias MCSE)

HMC 24.7 46.6 0.00376 0.0249

MHMC 9.68 32.7 0.00127 0.0132

6.2. Mixture of Gaussians

We compare MHMC vs. HMC on a simple, but interesting, 2D density over x = (x, y) R2 comprised of an evenly weighted mixture of isotropic Gaussians: p(x) = 1

2N(x; µ, Σ) + 1

2N(x; µ, Σ) for σ2 x = σ2 y = 1, ρxy = 0 and µ = (2.5, 2.5). This problem is challenging for HMC because the gradients in canonical Hamiltonian dynamics force it to one of the two modes. We tuned HMC to achieve an acceptance rate of .75 and used the same ϵ, L for MHMC, generating 15000 samples from both HMC and MHMC with these settings. The addition of the magnetic ﬁeld term G which has one degree of freedom in this case introduces an asymmetric curl into the dynamics that pushes the sampler across the saddlepoint to the other mode allowing it to efﬁciently mix around both modes and between them see Figure 3. The maximum mean discrepancy between exact samples generated from the target density and samples generated from both HMC and MHMC chains was also estimated for various magnitudes of G, using a quadratic ker-

Magnetic Hamiltonian Monte Carlo

2000 4000 6000 8000 10000 12000 14000 Number of Samples

Maximum Mean Discrepancy

g = 0.0 g = 0.05 g = 0.1 g = 0.15

Figure 3. Left: 500 samples from HMC; Middle: 500 samples from MHMC; Right: MMD between HMC/MHMC samples for various magnitudes of the non-zero component of the magnetic ﬁeld denoted g. Note g = 0 corresponds to standard HMC.

nel k(x, x ) = (1 + x, x )2 and averaged over 100 runs of the Markov chains (Borgwardt et al., 2006). Here, we clearly see that for various values of the nonzero component of G, denoted g, the samples generated by MHMC more faithfully reﬂect the structure of the posterior. As before, we ran 50 parallel chains for 107 timesteps to compute both the bias and Monte Carlo standard errors (MCSE) of the estimators of the target moments as shown in Table 3. Additional experiments over a range of ϵ, L (and corre-

Table 3. Bias MCSE for 2D Mixture of Gaussians for HMC vs. MHMC with g = 0.1. Note that E[x] = 0 and E[x2] = 7.25.

algorithm x (Bias MCSE) x2 (Bias MCSE)

HMC .0132 .0644 0.00264 0.0114

MHMC .00239 .012 0.000596 0.00365

sponding acceptance rates) and details are included in the Appendix for this example, demonstrating similar behavior.

6.3. Fitz Hugh-Nagumo model

Finally, we consider the problem of Bayesian inference over the parameters of the Fitz Hugh-Nagumo model (a set of nonlinear ordinary differential equations, originally developed to model the behavior of axial spiking potentials in neurons) as in (Ramsay et al., 2007; Girolami & Calderhead, 2011). The Fitz Hugh-Nagumo model is a dynamical system (V (t), R(t)) deﬁned by the following coupled differential equations:

V (t) = c(V (t) V (t)3/3 + R(t)) R(t) = (V (t) a + b R(t))/c (15)

We consider the problem where the initial conditions

(V (0), R(0)) of the system (15) are known, and a set of noise-corrupted observations (e V (tn), e R(tn))N n=0 = (Va,b,c(tn)+εV n , Ra,b,c(tn)+εR n )N n=0 at discrete time points 0 = t0 < t1 < < t N, are available - note that we illustrate dependence of the trajectories on the model parameters explicitly via subscripts. It is not possible to recover the true parameter values of the model from these observations, but we can obtain a posterior distribution over them by specifying a model for the observation noise and a prior distribution over the model parameters.

Similar to (Ramsay et al., 2007; Girolami & Calderhead, 2011), we assume that the observation noise variables (εV n )N n=0 and (εV n )N n=0 are iid N(0, 0.12), and take an independent N(0, 1) prior over each parameter a, b, and c. This yields a posterior distribution of the form

p(a, b, c) N(a; 0, 1)N(b; 0, 1)N(c; 0, 1)

n=0 N(e V (tn); Va,b,c(tn), 0.12) (16)

Importantly, the highly non-linear dependence of the trajectory on the parameters a, b and c yields a complex posterior distribution - see Figure 4. Full details of the model set-up can be found in (Ramsay et al., 2007; Girolami & Calderhead, 2011).

For our experiments, we used ﬁxed parameter settings of a = 0.2, b = 0.2, c = 3.0 to generate 200 evenlyspaced noise-corrupted observations over the time interval t = [0, 20] (as in (Ramsay et al., 2007; Girolami & Calderhead, 2011)). We performed inference over the posterior distribution of parameters (a, b, c) with this set of observations using both the HMC and MHMC algorithms, which was perturbed with a magnetic ﬁeld in each of the 3 axial planes of parameters along the ab, ac, and bc axes with magnitude g = 0.1. The chains were run to gener-

Magnetic Hamiltonian Monte Carlo

ate 1000 samples over 100 repetitions with settings of ϵ = 0.015, L = 10, which resulted in an average acceptance rate of .8. The effective sample size of each of the chains normalized per unit time was then computed for each chain.

Figure 4. Marginal posterior density contour plot over a, b, with c = 3.

Since each query to the posterior log-likelihood or posterior gradient log-likelihood requires solving an augmented set of differential equations as in (15), the computation time ( 238s) of all the methods was nearly identical. Moreover,

Table 4. HMC vs. MHMC performance targeting the Fitzhugh Nagumo posterior parameters

algorithm ESS ESS(a, b, c)/time (s)

HMC 1000 318 606 4.20, 1.33, 2.55

MHMC ab 1000 349 658 4.20, 1.47, 2.77

MHMC ac 1000 336 628 4.20, 1.42, 2.64

MHMC bc 1000 326 649 4.20, 1.37, 2.73

note that all methods achieved nearly perfect mixing over the ﬁrst coordinate so their effective sample size were truncated at 1000 for the a coordinate. In this example, we can see that all magnetic perturbations slightly increase the mixing rate of the sampler over each of the (b, c) coordinates with the ab perturbation performing best.

7. Discussion and Conclusion

We have investigated a framework for MCMC algorithms based on non-canonical Hamiltonian dynamics and have given a construction for an explicit, symplectic integrator that is used to implement a generalization of HMC we refer to as magnetic HMC. We have also shown several examples where the non-canonical dynamics of MHMC can improve upon the sampling performance of standard HMC. Important directions for further research include ﬁnding more automated, adaptive mechanisms to set the matrix G, as well as investigating positionally-dependent magnetic ﬁeld components, similar to how Riemannian HMC corresponds to local preconditioning. We believe that exploiting more general deterministic ﬂows (such as also maintaining a non-zero E in the top left-block of a general A matrix) could form a fruitful area for further research on MCMC methods.

Acknowledgements

The authors thank John Aston, Adrian Weller, Maria Lomeli, Yarin Gal and the anonymous reviewers for helpful comments. MR acknowledges support by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis. RET thanks EPSRC grants EP/M0269571 and EP/L000776/1 as well as Google for funding.

Arnold, Vladimir. Mathematical Methods of Classical Mechanics. 1989. ISBN 0387968903.

Betancourt, Michael. A Conceptual Introduction to Hamiltonian Monte Carlo. ar Xiv:1701.02434, 2017.

Borgwardt, Karsten M., Gretton, Arthur, Rasch, Malte J., Kriegel, Hans Peter, Scholkopf, Bernhard, and Smola, Alex J. Integrating structured biological data by Kernel Maximum Mean Discrepancy. In Bioinformatics, volume 22, 2006.

Carpenter, Bob, Gelman, Andrew, Hoffman, Matt, Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Marcus A, Li, Peter, and Riddell, Allen. Journal of Statistical Software Stan : A Probabilistic Programming Language. Journal of Statistical Software, VV(Ii), 2016.

Chen, Fang, Lov asz, L aszl o, and Pak, Igor. Lifting markov chains to speed up mixing. In Proceedings of the Thirtyﬁrst Annual ACM Symposium on Theory of Computing, STOC 99, pp. 275 281, New York, NY, USA, 1999. ACM. ISBN 1-58113-067-8. doi: 10.1145/301250. 301315.

Duane, Simon, Kennedy, A. D., Pendleton, Brian J., and Roweth, Duncan. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222, 1987.

Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, March 2011.

Girolami, Mark, Calderhead, Ben, and Chin, Siu A. Riemannian Manifold Hamiltonian Monte Carlo. Physics, 73(2):35, 2009.

Green, Peter J. Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika, 82(4):711 732, 1995.

Hairer, Ernst, Hochbruck, Marlis, Iserles, Arieh, and Lubich, Christian. Geometric Numerical Integration. Oberwolfach Reports, pp. 805 882, 2006.

Magnetic Hamiltonian Monte Carlo

Hoffman, Matt and Gelman, Andrew. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(2008):1 31, 2014.

Ma, Yi-an, Chen, Tianqi, and Fox, Emily B. A Complete Recipe for Stochastic Gradient MCMC. NIPS, pp. 1 19, 2015.

Neal, Radford. Bayesian Learning for Neural Networks, volume 118. 1996.

Neal, Radford M. MCMC using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, pp. 113 162, 2011.

Ramsay, J. O., Hooker, G., Campbell, D., and Cao, J. Parameter estimation for differential equations: A generalized smoothing approach. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 2007.

Rey-Bellet, Luc and Spiliopoulos, Konstantinos. Irreversible Langevin Samplers and Variance Reduction: a Large Deviations Approach. Nonlinearity, 28(7):2081, 2015.

Robert, Christian P and Casella, George. Monte Carlo Statistical Methods, volume 95. 2004.

Sohl-Dickstein, J, Mudigonda, Mayur, and De Weese, M. Hamiltonian Monte Carlo Without Detailed Balance. Proceedings of The 31st International Conference on Machine Learning, 32:9, 2014.