# fixeddistance_hamiltonian_monte_carlo__4aff7d66.pdf

Fixed-Distance Hamiltonian Monte Carlo

Hadi Mohasel Afshar CSIRO s Data61 Eveleigh, NSW 2015, Australia Hadi.Afshar@data61.csiro.au

Sally Cripps CSIRO s Data61 & The University of Sydney Eveleigh, NSW 2015, Australia Sally.Cripps@data61.csiro.au

We propose a variation of the Hamiltonian Monte Carlo sampling (HMC) where the equations of motion are simulated for a fixed traversed distance rather than the conventional fixed simulation time. This new mechanism tends to generate proposals that have higher target probability values. The momentum distribution that is naturally joint with our Fixed-Distance HMC (FDHMC), and keeps the proposal acceptance probability close to 1, is not Gaussian and generates momentums that have a higher expected magnitude. This translates into a reduced correlation between the successive MCMC states and according to our experimental results, leads to an improvement in terms of the effective sample size per gradient when compared to the baseline HMC and No-U-Turn (NUTS) samplers.

1 Introduction

Markov chain Monte Carlo (MCMC) is an inference mechanism that approximates a target probability distribution by a sequence of states (a.k.a. samples or draws) (Chib and Greenberg, 1995). Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011) is an MCMC algorithm where the current state is considered as a particle (usually a unit mass) to which a randomly drawn momentum vector is assigned. The state is also associated with a potential energy that is proportional to its negative log target probability density. The next state is proposed via evolving the current state according to a simulation of the equations of motion (using leapfrog integration) for a fixed time duration. If the simulation is accurate, the probability of accepting the proposal as the next state in the Markov chain will be close to one. This property makes HMC a powerful sampling tool where the proposal can be quite distant from the current state while being accepted with a high probability, leading to a comparatively rapid exploration of the target distribution. In this work we further improve on HMC by highlighting two of its limitations and resolving them.

Limitation 1. Along the trajectory that is traversed during any fixed-time simulation of the equations of motion, it is more likely that a state with low target probability is proposed: In HMC, the state is guided by the gradient information of the target density to move towards regions of high probability (i.e. low potential engery). However, it moves through these regions with a high velocity and thus oscillates around them, spending more time in the surrounding low-probability regions (i.e. hills on the potential energy) where it moves with low velocity. As such, it is more likely that the state is in a low-probability region when the simulation time is over. This bias is counter intuitive.

What makes the overall sampling process unbiased (in the sense that the MCMC chain converges to the target distribution) relies on the HMC s momentum distribution: Note that a state that has a lowmagnitude momentum cannot move to a low-probability region (since at some level, its kinetic energy will be zero, the direction of the momentum vector will be reversed and the state will be pulled back) but a state with low-magnitude momentum can always move from a low-probability region to a highprobability region as its kinetic energy will only be increased. Therefore, if the momentum distribution is concentrated around 0, there is a strong bias towards high-probability regions. Conversely, if the momentum distribution favours large magnitudes, the bias is less pronounced. The exact distribution

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

that generates right amount of low-magnitude momentums to counterbalance the bias that is induced by Limitation 1, happens to be a 0-mean Gaussian. However using this distribution to generate low-magnitude momentums causes another limitation for HMC:

Limitation 2. HMC suffers from correlated samples in regions of high probability: A state that has a low-magnitude momentum, cannot leave a high-probability region (as it does not have sufficient kinetic energy to lift it out of the potential energy valley). During the simulation time, it only oscillates (or shivers) in its place leading to a proposal that is not distant from the current state. Therefore, despite HMC s nominal high proposal acceptance probability, it can also produce highly correlated close-by samples.

Note that algorithms such as No-U-turn sampling (NUTS) (Hoffman and Gelman, 2014) where the number of simulation leapfrog steps, and hence the simulation duration, is dynamically decided, do not resolve Limitation 1, because regardless of the number of the leapfrogs, or whether the state has made a U-turn or not, high-probability regions are traversed faster than low probability regions (and less intermediate states are generated from them) and therefore they are under-represented. Regarding Limitation 2, a dynamically decided simulation duration, may save some computation by avoiding redundant oscillations for a state that is trapped in an energy valley, but it does not guarantee that the generated proposal is distant from the current state and hence limitation 2 is not addressed either.

The contribution of the present work is to systematically tackle both aforementioned HMC limitations by simulating the equations of motion for a fixed traversed distance rather than a fixed evolution time. An evolving state may spend a longer time in low-probability regions but due to its low momentum/velocity, the traversed distance (in each leapfrog step) is short. Conversely, even though less time may be spent in high-probability regions, due to the higher momentum, the traversed distance (per leapfrog step) is longer. Therefore the fixed distance budget is exhausted more rapidly in high-probability regions and the bias against these regions disappears.

This modification to the HMC algorithm also resolves Limitation 2. Intuitively speaking, no counterbalancing is required because there is no bias towards regions of low probability in our Fixed Distance HMC (FDHMC). As such, in order to maintain the proposal acceptance probability close to 1, FDHMC s momentum distribution needs to generate values with higher expected magnitude than those generated from a Gaussian centred at 0. This modification translates into a faster exploration of the probability space and substantially less correlated samples.

To design FDHMC, we rely on the theory of Reversible Jump MCMC (RJMCMC) (Green, 1995). There are recent works that see HMC in the light of RJMCMC (Levy et al., 2018; Afshar et al., 2021) to design non-volume-preserving extensions. But we are unaware of any existing variation of HMC where the traversed distance is fixed, or where the momentum vectors are drawn from a distribution other than normal, or where any solution is proposed for the two aforementioned limitations. After providing a brief background in Section 2 we will design FDHMC in sections 3 and 4. According to our experiments that are provided in Section 5, FDHMC can outperform the baseline HMC and NUTS by a substantial margin, in terms of the effective sample size per gradient.

2 Background

Our ultimate task is to approximate a target density, πQ(q), defined on Rn, with a set of states {q(c)}N c=1 (also known as samples or draws). In Markov chain Monte Carlo (MCMC) framework, each sample, q(c), is drawn conditioned on the previous sample, q(c 1). Nonetheless, the approximation is unbiased. That is, if the number of draws N , then the number of draws that are within any region, A Rn, is proportional to the probability mass associated with that region, R

Reversible Jump MCMC (Green, 1995) is an MCMC technique where to draw the next state, q(c+1): (a) The current state, q(c), is augmented by an auxiliary vector, r(c), that is drawn from an arbitrary auxiliary distribution (that throughout, we assume, is continuous with a density function , πR(r), that is defined on Rm, for some dimensionality m). (b) (q( ), r( )) := F(q(c), r(c)) is proposed as the next augmented state where F : Rn+m Rn+m is an arbitrary differentiable mapping that is reversible, in the sense that F is the inverse of itself:

F(q, r) = (q , r ), if and only if, F(q , r ) = (q, r). (Reversibility condition) (1)

(c) With Metropolis-Hastings-Green (MHG) acceptance probability:

α (q(c), r(c)) (q( ), r( )) := min 1, πQ(q( ))πR(r( ))

πQ(q(c))πR(r(c)) (q( ), r( ))

(q(c), r(c))

the proposal, (q( ), r( )), is accepted as the next augmented MCMC state, (q(c+1), r(c+1)), otherwise the current state, (q(c), r(c)), is returned.

(d) q(c+1) is kept (as the next draw from the target density) and the auxiliary r(c+1), is discarded.

Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) is a special case of RJMCMC where (a) the original state, q, (that in this context, is referred to as the position vector) and the auxiliary vector (that is referred to as the momentum vector and is usually denoted by p) have the same dimensions (i.e. n = m). (b) The auxiliary density function, πP(p), is an n-variate normal distribution which is usually assumed to be standard, N(0n, In n). (c) The reversible mapping, F, is the simulation of the equations of motion for the total time ϵL, via momentum-Verlet integration with L leapfrog steps, each of duration ϵ, followed by reversing the sign of the momentum vector:

F(q, p) := g f L,ϵ f L 1,ϵ f1,ϵ(q, p),

where g(q, p) = (q, p) and each momentum-Verlet leapfrog step fi,ϵ(q[i], p[i]) = (q[i+1], p[i+1]) is defined by the following equations,

2 ] = p[i] ϵ

2 U(q[i]), q[i+1] = q[i] + ϵp[i+ 1

2 ], p[i+1] = p[i+ 1

2 U(q[i+1]), (3)

where the potential energy function U(q) := log(πQ(q)), and U(q) is its gradient vector. It can be easily verified that this mapping, F(q, p), is reversible. Also, since F(q, p) is only composed of shear mappings and a negation, it preserves the volume i.e. the absolute Jacobian determinant of the total transformation is equal to 1. As such, MHG acceptance probability (2) simplifies to:

min 1, πQ(q( ))πP(p( ))

πQ(q(c))πP(p(c)) = min

1, exp{ (U(q( )) + 1

2 p( ) 2)} exp{ (U(q(c)) + 1

Evolution of a phase state, (q, p), via the equations of motion conserves the Hamiltonian, U(q) + 1 2 p 2. As such, if the leapfrog approximation is reasonably precise, HMC s proposal acceptance probability is close to 1. Seeing HMC as an instance of RJMCMC provides us with the insight required to design our own Fixed-distance HMC in the next sections.

3 Reversible leapfrogs that traverse a fixed distance

Designing a reversible fixed-distance leapfrog mechanism is challenging. This problem is illustrated in Figure 1 where each phase state, (q, p), is depicted by a node, q, to which an arrow, p, is attached. An initial phase state (q, p) := (q0, p0) that goes through k leapfrog steps (each as in Figure 1-a) evolves into (q[k], p[k]) and traverses the total distance, d[k](q, p), that is:

d[k](q, p) :=

i=1 q[i] q[i 1] =

i=1 ϵ p[i 1] .

We want to restrict the total traversed distance to a fixed value, D. In Figure 1-b, this distance budget is exhausted between the 4th and 5th leapfrog steps. That is, d[4](q, p) < D < d[5](q, p). Suppose we return (q[4], p[4]) as the final state, F(q, p) = (q := q[4], p := p[4]). This process may seem reversible, as by 4 leapfrogs, (q[4], p[4]) is mapped back to (q, p). Nonetheless, in the backward direction, the termination condition may not happen at (q, p). For instance, in Figure 1-c, d[6](q , p ) < D < d[7](q , p ), which suggests that F(q , p ) = (q [6], p [6]). This violates the required RJMCMC reversibility condition. An alternative potential solution is to add an extra leapfrog step with duration, τ :

τ := D d[4](q4, p4)

to the forward pass which maps the state right to a point where the traversed distance, d[k](q, p), is exactly equal to D. That is, F(q, p) = (q[x], p[4]). However, this process is not reversible either

A phase state (q, p)

Momentum half-step p[ 1

Position full-step q[1] q + ϵp[ 1

Momentum half-step p[1] p[ 1

2 U(q[1]) p[ 1

q [3] = q[1]

Figure 1: Fixing the traversed distance of the ordinary leapfrog mechanism: (a) A single leapfrog step mapping (q, p) to (q[1], p[1]). (b) A leapfrog mechanism starts at (q, p) and exhausts the traversed distance budget between the 4th and 5th steps at q[x]. (c) The leapfrog mechanism that starts from (q := q[4], p := p[4]) exhausts the distance budget at q [x] and is not reversible.

because the last forward step has a duration τ < ϵ, and does not match the first leapfrog step (of the duration ϵ) in the backward direction. Our key insight is that by adding another leapfrog step of some duration, τ < ϵ, to the beginning of the process, we can make the transition symmetric and reversible. This algorithm is illustrated in Figure 2 and is described in Section 3.1.

3.1 Fixed-distance Leapfrog Mechanism (FD-Leapfrogs)

Given the leapfrog step size, ϵ, and the total evolution distance, D, as hyper-parameters, we define an (n + 1)-dimensional RJMCMC auxiliary vector (p, τ) where τ Unif(0, ϵ). We also define a function F : R2n+1 R2n+1 that maps (q, p, τ) to (q , p , τ ) as follows:

1. Initial position τ-step: Evolve the position, q, for time τ with fixed momentum, p.

q[1] := q + τp (4)

2. Initial momentum full-step:

p[1] := p ϵ U(q[1]) (5)

3. Interchanged position and momentum full-steps: For i = 2, . . . , k, update the position and momentum vectors interchangeably as follows:

q[i] := q[i 1] + ϵp[i 1] (6) p[i] := p[i 1] ϵ U(q[i]) (7)

where k is the maximum number of leapfrogs s.t. the total traversed distance does not exceed D:

d[k](q, p, τ) D < d[k+1](q, p, τ), where d[j](q, p, τ) := τ p +

i=1 ϵ p[i] , j N. (8)

4. Final position τ -step: With momentum p[k], the remaining distance, d := D d[k](q, p, τ), is traversed in time τ := d p[k] , to reach the final position state, q :

q := q[k] + τ p[k]. (9)

5. Negating the final momentum: (q , p := p[k], τ ) is returned as F(q, p, τ).

The relation between FD-leapfrogs and the ordinary leapfrog mechanism. Traditionally, in leapfrog integration, the position and momentum vectors are updated at steps i and i+ 1

2, respectively.

A phase state (q, p)

Position τ-step q[1] q + τp

Momentum full-step p[1] p ϵ U(q[1])

Position full-step q[2] q[1] + ϵp[1]

Figure 2: Reversibility of FD-leapfrogs: (a) The first 3 steps of FD-leapfrog mechanism. (b) FDleapfrogs map (q, p, τ) to (q , p , τ ). (c) FD-leapfrogs map (q , p , τ ) back to (q, p, τ).

However, by shifting the step indicator by half, we end up in an equally valid leapfrog integrator that updates momentums at i and positions at i + 1

2, which is called position-Verlet integration (Batcho and Schlick, 2001) (Compare with momentum-Verlet (3)):

2 ] = q[i] + ϵ

2p[i], p[i+1] = p[i] ϵ U(q[i+ 1

2 ]), q[i+1] = q[i+ 1

Nonetheless, the first position update of the i-th leapfrog is the same as the second position update of the (i 1)-th leapfrog. Therefore, the ordinary position-Verlet leapfrog integration can also be formulated as: (1) an initial position evolution for duration ϵ

2 (half-step), (2) an initial momentum full-step, (3) (L 1) interchanged position and momentum full-steps, (4) a final position half-step and (5) negation of the final momentum.

Our proposed FD-Leapfrog algorithm is very similar to position-Verlet, with major differences being:

1. The durations of the initial and final position evolutions, τ and τ , are values in the interval (0, ϵ) rather than being fixed to ϵ

2, albeit, E(τ) = ϵ

2 since τ Unif(0, ϵ). 2. The number of interchanged position and momentum full-steps is decided based on the traversed distance rather than being fixed to (L 1).

Reversibility. The reversibility of FD-Leapfrogs is established by Lemma 3.1 and is illustrated in the example of Figure 2. In the forward evolution (Figure 2-a), an augmented state (q, p, τ) is mapped to (q , p , τ ) and the total traversed distance is:

i=1 ϵ p[i] + τ p[5]

(here, k = 5). In Figure 2-b, the evolution starts from (q , p , τ ). It can be verified that

p = p[5], p [1] = p[4], . . . , p [5] = p.

At this stage, (i.e. k = 5), the (backward) traversed distance is: τ p + P4 i=1 ϵ p [i] which is equal to τ p[5] + P4 i=1 ϵ p[i] . The remaining distance is: τ p , and by the current momentum, p [5] = p, it is traversed in time τ. Since τ < ϵ, at this stage the interchanged full-steps terminate, steps 4 and 5 of the algorithm are run and F(q , p , τ ) = (q, p, τ) is returned. Lemma 3.1. Fixed-distance Leapfrog Mechanism (FD-Leapfrogs) is reversible.1

Being provided with a reversible mapping F(q, p, τ) = (q , p , τ ), an HMC sampler with fixeddistance evolution mechanism is proposed in Section 4.

1The formal proof is provided in the supplementary material.

4 Fixed-Distance Hamiltonian Monte Carlo (FDHMC)

To design a RJMCMC based on the mapping, F(q, p, τ) = (q , p , τ ), we need to compute the absolute Jacobian determinant, | (q , p , τ )/ (q, p, τ)|. The Jacobian of a complicated transformation is typically calculated via decomposing the mapping into simpler steps, as the Jacobian determinant of a composition of functions is equal to the product of the Jacobian determinants of the parts. We observe that F can be expressed as a composition of functions with form: Υi(q, p, ζ) = (q , p , ζ ), where ζ is an auxiliary variable with alternating semantics. That is, ζ is initially set to time τ, then ζ represents the remaining distance, d, and in the final steps, ζ represents time τ : Lemma 4.1. The Absolute Jacobian determinant of the Fixed-distance Leapfrog Mechanism (FDleapfrogs), F(q, p, τ) = (q , p , τ ), is p

Proof. FD-Leapfrogs can be reformulated as:

F(q, p, τ) = Υ7 Υ1 Υ6 (Υ3 Υ5 Υ4) . . . (Υ3 Υ5 Υ4) Υ3 Υ2 Υ1(q, p, τ),

where the mappings Υ1 to Υ7 are defined as follows:

Υ1(q, p, ζ) := (q + ζp, p, ζ), Υ2(q, p, ζ) := (q, p, D ζ p ), Υ3(q, p, ζ) := (q, p ϵ U(q), ζ), Υ4(q, p, ζ) := (q + ϵp, p, ζ) Υ5(q, p, ζ) := (q, p, ζ ϵ p ), Υ6(q, p, ζ) := (q, p, ζ p )

Υ7(q, p, ζ) := (q, p, ζ).

Υ1 is the initial position τ-step (Alg. step 1). Υ2 returns the remaining distance as its last output. Υ3 is a momentum full-step (Alg. step 2). Υ4 is a position full-step. Υ5 updates the remaining distance. As such, the composition, (Υ3 Υ5 Υ4), stands for an interchanged position and momentum full-step, along with updating the remaining distance, d, (Alg. step 3) and is applied (k 1) times to generate the output (q[k], p[k], d). Υ6 computes the remaining time, τ = d/ p[k] . The final position τ -step (Alg. step 4) is carried out by applying Υ1 once more. Finally, negating the final momentum (Alg. step 5) happens in Υ7. Except Υ2 and Υ6, all the above transformations preserve volume (because either they are shear mappings or a negation). As such: F(q, p, τ)

= Υ6(q, p, ζ)

(q,p,ζ) (q[k],p[k],d) Υ2(q, p, ζ)

(q,p,ζ) (q[1],p,τ)

= p p[k] = p

4.1 Marginal momentum distribution

If we assume that FDHM has the same momentum distribution that the existing HMC has, i.e. πP(p) := N(0, I), then by substituting (10) in (2), FDHMC s acceptance probability becomes:

α ((q, p, τ) (q , p , τ )) = min 1, πQ(q ) exp( 1

2 p 2) Unif(τ ; 0, ϵ) πQ(q) exp( 1

2 p 2) Unif(τ; 0, ϵ) p

If the Hamiltonian dynamics is (roughly) exact, the above quantity is (almost) equal to min n 1, p

p o . Depending on the magnitude of the current and proposed momentums, this acceptance probability can be quite low which is undesirable. Nonetheless, the interesting point that we would like to highlight is that α( ) tends to reject proposals whose target probability is more than the target probability of the current state but always accepts the less probable proposals.2 This rather counter-intuitive behaviour can be justified as follows: The role of the acceptance probability in the theory of MCMC is to counter-balance any bias that may exist in the proposal generation mechanism and to guarantee that the draws are proportional to the target distribution. For example, the proposal generation of a

2If πQ(q ) > πQ(q), then the phase state has moved from a hill on the potential energy function into a valley and therefore, has gained more momentum. Therefore, p > p which leads to an acceptance probability, α( ) < 1. By a similar argument, if πQ(q ) < πQ(q), then p < p and α( ) = 1.

(a) 1D FDHMC/HMC πP(p)

(b) 2D FDHMC πP(p)

(c) 2D HMC πP(p)

Figure 3: The momentum density function, πP(p), of FDHMC versus the Gaussian momentum density function of the existing HMC associated with (a) a 1D target space, (b) & (c) 2D target spaces.

Random Walk Metropolis-Hastings (RWMH) sampler (Metropolis et al., 1953) is totally independent of the target density. As such, its corresponding acceptance probability, min{1, πQ(q )/πQ(q)}, favours the more probable proposals. The reason that the acceptance probability of HMC (with exact dynamics) is 1 is that its proposal generation mechanism is already unbiased. To be more precise, the bias towards low probable states (caused by Limitation 1) is canceled out by the bias towards high probable states (caused by Limitation 2). By fixing the distance rather than time, we have resolved the first bias. As such, the effect of the second bias prevails and is counter-balanced by α( ). By choosing the following momentum density:

πP(p) p exp( p 2/2), (FDHMC Momentum density) (11)

the desirable acceptance probability of the existing HMC is re-established:

αFDHMC ((q, p, τ) (q , p , τ )) = min 1, πQ(q ) πP(p ) Unif(τ ; 0, ϵ)

πQ(q) πP(p) Unif(τ; 0, ϵ) (q , p , τ )

(10), (11) = min 1, πQ(q ) p exp( p 2/2)

πQ(q) p exp( p 2/2) p

= min 1, πQ(q ) exp( p 2/2)

πQ(q) exp( p 2/2)

Figure 3 shows that the momentum density of FDHMC is strongly inclined towards generating momentums that have higher magnitudes. In fact for FDHMC, πP(0) = 0, while for HMC, πP(0) is the peak of the density function. This means that FDHMC does not suffer from the second bias of HMC (i.e. Limitation 2) either, because with momentums that (in average) have higher magnitudes, the state will not be easily trapped in the valleys of the potential energy function.

4.2 Sampling the initial momentum

We need a fast way to directly draw a momentum (per FDHMC sample) from the distribution, πP(p), that is defined by (11). We note that even though the support of this function is Rn, it is totally symmetric with respect to the angular coordinates and is only a function of the magnitude, p (see Figure 3). Lemma 4.2 implies that the momentum magnitude, m := p , is distributed as:

πM(m) m exp m2/2 mn 1 = χ(m; n + 1),

where m := p and χ(m; n + 1) mn exp( m2/2), is a Chi distribution defined on the support: m [0, ) with the degree of freedom, (n + 1).

As such, the process of drawing a sample p(c) from πP(p) p exp 1

2 p 2 is as follows:

1. Draw a direction vector, v Rn, from a uniform n-sphere, e.g. by normalising a draw from a normal distribution: v := s/ s where s N(0n, In n). 2. Draw a magnitude, m, from a Chi distribution χ(m; n + 1). 3. p(c) is generated via scaling the drawn direction by the drawn magnitude: p(c) = m v.

Algorithm 1: FDHMC q(c), U

Input: q(c), current state X Rn; U : X R, Potential energy function (neg log of the target density, πQ); Hyperparameters: D R, total evolution distance; ϵ R, leapfrog step-size.

The following line is equivalent to: p(c) πP(p) p exp( p 2

2 ) s N(0n, In n); m χ(m; n + 1); p(c) m s s where χ(m; n + 1) mn exp( m2/2)

τ (c) Unif(0, ϵ) duration of the initial fixed-momentum evolution q q(c) + τ (c) p(c) initial evolution of the position vector for time τ (c) d = D τ (c) p(c) This is the remaining distance p p(c) ϵ U(q) initial evolution of the momentum vector for time ϵ

while a position full-step evolution is possible without exceeding the remaining distance, d: while ϵ p < d do

q q + ϵ p position full-step d d ϵ p update the remaining distance p p ϵ U(q) momentum full step

τ ( ) d p ; q( ) q + τ ( ) p; p( ) p; constructing the final proposal

if u Unif(0, 1) < πQ(q( )) exp( p( ) 2/2) πQ(q(c)) exp( p(c) 2/2) then return q( ) else return q(c)

Lemma 4.2. If a density function, πX(x) := πX(x1, . . . , xn) (on Rn), is only a function of magnitude, x , then the random variable m = x , is distributed with density πM(m) mn 1 g(m), where g(m) := πX([m, 0, . . . , 0] ) = πX([0, m, 0 . . . , 0] ) = . . . = πX([0, . . . , 0, m] ).

Proof. The CDF, ΠM(a), meaning the probability that m < a is:

ΠM(a) := Z a

m=0 πM(m)dm. (13)

This is equivalent to integrating over the density function, πX(x), within the hypersphere, x m:

x1= . . . Z

xn= I [ x a] πX(x)dx, integration on an n-sphere of radius a

ϕ1=0 . . . Z π

ϕn 1=0 g(r) rn 1 sinn 2(ϕ1) sinn 3(ϕ2) . . . sin(ϕn 1)

drdϕ1 . . . dϕn 1, with Jacobian terms due to the change of coordinates to spherical

r=0 g(r) rn 1dr. (14)

Comparing (13) and (14) shows that πM(m) g(m) mn 1.

The complete FDHMC sampler is illustrated in Algorithm 1. It can be seen that in terms of implementation complexity, this algorithm is quite comparable with the baseline HMC. Nonetheless, as our experimental results will show, in terms of Effective Sample Size per gradient (ESS/grad), FDHMC can largely outperforms the baseline.

5 Experimental evaluation

We compare the performance of FDHMC against 3 baselines: (a) static HMC (Neal, 2011), and two versions of No-U-turn sampler (Hoffman and Gelman, 2014) (NUTS) including (b) Dynamic Slice HMC (DSHMC), which is the original implementation (Hoffman and Gelman, 2014), and (c) Dynamic Multinomial HMC (DMHMC) (Betancourt, 2017), which is the current version of NUTS used in STAN (Carpenter et al., 2017).

For tuning HMC, the initial total simulation duration, λ = ϵL, is set to 2.0 and the step-size is tuned by dual averaging (Nesterov, 2009; Hoffman and Gelman, 2014). If dual averaging does not converge

we halve the value of λ and repeat this process until the convergence conditions are satisfied. For tuning DSHMC and DMHMC, we rely on Mici probabilistic programming language (Graham, 2019).

We tune the the parameters of FDHMC by an algorithms that is presented in the supplementary material and works well on our experimental models. This includes tuning the step-size, ϵ, via dual averaging and the following heuristic for tuning the fixed distance, D: We observe that if D is too large, its correlation with the expected distance of the successive samples, E[ q(i) q(i 1) ], is low. As such, we primarily draw N = 500 samples, q(i), from an FDHMC sampler that is associated with a sufficiently large fixed-distance, D . Then, we set

i=2 q(i) q(i 1) ,

as the final fixed-distance. To decide the initial D , we set ϵ = 1 and then repeatedly double or halve the value of ϵ till the acceptance probability of the Langevin proposal with step-size ϵ crosses 0.5. Then we set D := 10ϵ . To stabilise the performance of this Langevin dynamics, the magnitude of its associated momentum vector, p, is fixed to the mean of χ(x, n + 1), i.e.

2Γ(n/2 + 1)/Γ((n + 1)/2).

Our experimental models include:

1. n-dimensional multivariate normal distributions (MVN) where n {10, 30, 100, 300} and the covariance matrix is drawn from a Wishard distribution which has an identity scale matrix and its degree of freedom is equal to the model s dimension, n. 2. Neal s funnel (Neal, 2003) (FNNL):

πQ(q1, . . . , qn) = N q1; 0, σ2 n Y

i=2 N (qi; 0, exp(kq1)) ,

where the dimension, n {5, 10, 50, 100}, σ2 = 1 and k = 3. 3. The posterior density function, πQ(α, β|X, y) (15), associated with Bayesian Logistic Regression for Binary classification where each data point contains a vector of predictors, xi, and the class label yi { 1, 1}:

πQ(α, β|X, y) exp

i=1 log (1 + exp( yi(α + xi β)))

We set σ2 = 1 and apply the model to the following three data sets that are available from the UCI repository (Frank et al., 2011): (a) Australian Credit Approval (Aus Cr) data set where each data point have 14 predictors (as such, n = 15) and we only use the first 100 data points in the data set. (b) SPECT Heart (SPECT) data set with 267 data points, each with 22 predictors. (c) German Credit (Gr Cr) data set with 1000 data points, each with 24 predictors. In these 3 data sets, we normalise all predictors, to have zero mean and unit variance.

For each sampling algorithm, 50 independent MCMC chains are run and the average Effective Sample Size per gradient (ESS/grad) 95% condifence interval is reported in Table 1. The length of each chain is 1000 samples which are drawn after an initial 200 burn-in draws. In each chain iteration, all samplers start from a same initial states and the ESS is approximated by following the details described by Hoffman and Gelman (2014).

The results show that in 8 out of the 11 experiments, FDHMC outperforms the baselines and in some models the difference is substantial.

6 Conclusion

The main insight of the present work is that the fixed-time evolution of a phase state via the equations of motion is intrinsically biased towards generating proposals that have low target probability (Limitation 1). In HMC, this bias is compensated by drawing momentums from a normal distribution centered at 0. Many such sampled momentum vectors have small magnitudes and do not let the

Model dim HMC DSHMC DMHMC FDHMC

MVN 10 0.0174 0.0026 0.0318 0.0007 0.0308 0.0005 0.0252 0.0036 MVN 30 0.0006 0.0002 0.0012 0.0001 0.0012 9.80e 5 0.0031 0.0025 MVN 100 0.0002 0.0001 0.0002 2.29e 5 0.0002 2.39e 5 0.0591 0.0123 MVN 300 0.0069 0.0031 0.0017 8.39e 6 0.0015 6.78e 6 0.0629 0.0069 FNNL 5 0.0084 0.0023 0.0006 0.0001 0.0011 0.0002 0.0080 0.0026 FNNL 10 0.0027 0.0009 0.0005 0.0001 0.0005 0.0001 0.0030 0.0012 FNNL 50 0.0007 0.0005 0.0001 5.72e 5 8.18e 5 2.32e 5 0.0008 0.0003 FNNL 100 0.0009 0.0005 8.36e 5 3.61e 5 9.11e 5 3.87e 5 0.0045 0.0078 Aus Cr 15 0.0132 0.0010 0.0286 0.0011 0.0276 0.0008 0.0241 0.0077 SPECT 23 0.0122 0.0008 0.0190 0.0004 0.0254 0.0006 0.0259 0.0013 Gr Cr 25 0.0003 0.0001 0.0225 0.0009 0.0241 0.0007 0.0252 0.0016 Table 1: Comparing FDHMC with HMC, DSHMC and DMHMC, based on the Effective Sample Size per gradient, that is approximated via 50 independent MCMC chains, each of 1000 states.

state leave the high-probability regions. This causes a (second) bias that compensates the first bias. Nevertheless, this counterbalancing is obtained in the expense of more correlation between the successive states that are drawn from the high probability regions (Limitation 2). In the light of RJMCMC theory, we modified the core of the HMC algorithm and designed a sampler where the state is evolved for a fixed traversed distance rather than a fixed evolution time. This dynamics is not biased towards the proposals that have low target probability and is coupled with a momentum distribution that generates vectors which have a higher expected magnitude. According to our experimental results, this can lead to a better exploration of the space and a higher effective sample size per gradient while maintaining the high proposal acceptance probability of the baseline HMC.

The proposed FDHMC can be implemented efficiently and with appropriate modifications, it can be combined with many variations of HMC. More specifically, even though for simplicity throughout we focused on combining FD-leapfrogs with the baseline HMC, this mechanism can be combined with any variation of HMC that relies on Stormer Verlet leapfrog integration e.g. the case where the mass matrice is not unit. Studying the performance of such variations of FDHMC as well as alternative methods to tune its hyper-parameters can be pursued in a future line of research.

Acknowledgments and Disclosure of Funding

This research has been supported by the National Health and Medical Research Council (NHMRC), Research Grant: GNT1149976. PURL: http://purl.org/au-research/grants/nhmrc/GNT1149976.

Hadi Mohasel Afshar, Rafael Oliveira, and Sally Cripps. Non-volume preserving Hamiltonian Monte Carlo and No-U-Turn samplers. In International Conference on Artificial Intelligence and Statistics, pages 1675 1683, 2021.

Paul F Batcho and Tamar Schlick. Special stability advantages of position-verlet over velocity-verlet in multiple-time step integration. The Journal of Chemical Physics, 115(9):4019 4029, 2001.

Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. ar Xiv preprint ar Xiv:1701.02434, 2017.

Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017.

Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-Hastings algorithm. The American statistician, 49(4):327 335, 1995.

Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987.

Andrew Frank, Arthur Asuncion, et al. UCI machine learning repository, 2010. URL http://archive. ics. uci. edu/ml, 15:22, 2011.

Matthew Graham. Graham: Mici: Markov chain Monte Carlo (MCMC) methods for Python. URL: https://git.io/mici.py, 2019.

Peter J Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711 732, 1995.

Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing Hamiltonian Monte Carlo with Neural Networks. International Conference on Learning Representations (ICLR), 2018.

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953.

Radford M Neal. Slice sampling. Annals of statistics, pages 705 741, 2003.

Radford M. Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, pages 113 162. Chapman & Hall, 2011.

Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In Section 5, we mentioned that there is room for improving the hyper-parameter tuning of the algorithm. This can be a future direction of research. (c) Did you discuss any potential negative societal impacts of your work? [No] This work is mainly theoretical and improves an existing general-purpose inference tool. It does not have any positive or negative social impact on its own. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] We reported the Effective Sample Size per Gradient which does not depend on the used resources. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We cited the UCI repository.

(b) Did you mention the license of the assets? [No] No license is required.

(c) Did you include any new assets either in the supplemental material or as a URL? [No]

No new assets (other than the provided code). (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We do not use such data. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] We do not use such data. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] not applicable to this work. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] not applicable to this work. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] not applicable to this work.