# entropybased_adaptive_hamiltonian_monte_carlo__3fe289c8.pdf Entropy-based adaptive Hamiltonian Monte Carlo Marcel Hirt Department of Statistical Science University College London, UK marcel.hirt.16@ucl.ac.uk Michalis K. Titsias Deep Mind London, UK mtitsias@google.com Petros Dellaportas Department of Statistical Science University College London, UK Department of Statistics Athens Univ. of Economics and Business, Greece and The Alan Turing Institute, UK Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm to sample from an unnormalized probability distribution. A leapfrog integrator is commonly used to implement HMC in practice, but its performance can be sensitive to the choice of mass matrix used therein. We develop a gradient-based algorithm that allows for the adaptation of the mass matrix by encouraging the leapfrog integrator to have high acceptance rates while also exploring all dimensions jointly. In contrast to previous work that adapt the hyperparameters of HMC using some form of expected squared jumping distance, the adaptation strategy suggested here aims to increase sampling efficiency by maximizing an approximation of the proposal entropy. We illustrate that using multiple gradients in the HMC proposal can be beneficial compared to a single gradientstep in Metropolis-adjusted Langevin proposals. Empirical evidence suggests that the adaptation method can outperform different versions of HMC schemes by adjusting the mass matrix to the geometry of the target distribution and by providing some control on the integration time. 1 Introduction Consider the problem of sampling from a target density π on Rd of the form π(q) e U(q), with a potential energy U : Rd R being twice continuously differentiable. HMC methods [20, 46, 9] sample from a Boltzmann-Gibbs distribution µ(q, p) e H(q,p) on the phase-space R2d based on the (separable) Hamiltonian function H(q, p) = U(q) + K(p) with K(p) = 1 The Hamiltonian represents the total energy that is split into a potential energy term U and a kinetic energy K which we assume is Gaussian for some symmetric positive definite mass matrix M. Suppose that (q(t), p(t))t R evolve according to the differential equations dt = H(q(t), p(t)) p = M 1p(t) and dp(t) dt = H(q(t), p(t)) q = U(q(t)). (1) Let (ϕt)t 0 denote the flow of the Hamiltonian system, that is for fixed t, ϕt maps each (q, p) to the solution of (1) that takes value (q, p) at time t = 0. The exact HMC flow ϕ preserves 35th Conference on Neural Information Processing Systems (Neur IPS 2021). volume and conserves the total energy i.e. H ϕt = H. Consequently, the Boltzmann-Gibbs distribution µ is invariant under the Hamiltonian flow, that is µ(ϕt(E)) = µ(E) for any Borel set E R2d. Furthermore, the flow satisfies the generalized reversibility condition F ϕt = ϕ t F with the flip operator F(q, p) = (q, p). Put differently, the Hamiltonian dynamics go backward in time by negating the velocity. If an analytical expression for the exact flow were available, one could sample from µ using the invariant Markov chain that at state (q, p) first draws a new velocity p N(0, M) with the next state set to ϕT (q, p ) for some integration time T > 0. Such a velocity refreshment is necessary as the HMC dynamics preserve the energy and so cannot be ergodic. However, the Hamiltonian flow cannot be computed exactly, except for very special potential functions. Numerical approximations to the exact solution of Hamiltonian s equations are thus routinely used, most commonly the leapfrog method, also known as (velocity) Verlet integrator [28, 10]. For a step size h > 0 and L steps, such an algorithm updates the previous state q0 and a new velocity p0 N(0, M) by setting, for 0 ℓ L 1, 2 U(qℓ); qℓ+1 = qℓ+ h M 1pℓ+ 1 2 ; pℓ+1 = pℓ+ 1 2 U(qℓ+1). (2) This scheme can be motivated by splitting the Hamiltonian wherein the kick mappings in the first and third step update only the momentum, while the drift mapping in the second step advances only the position q with constant speed. For T = Lh, the leapfrog integrator approximates ϕT (q0, p0) by (q L, p L) while also preserving some geometric properties of ϕ, namely volume preservation and generalized reversibility. The leapfrog method is a second-order integrator, making an O(h2) energy error H(q L, p L) H(q0, p0). A µ-invariant Markov chain can be constructed using a Metropolis Hastings acceptance step. More concretely, the proposed state (q L, p L) is accepted with the acceptance rate a(q0, p0) = min{1, exp [ (H(q L, p L) H(q0, p0))]}, while the next state is set to F(q0, p0) in case of rejection, although the velocity flip is inconsequential for full refreshment strategies. We want to explore here further the generalised speed measure introduced in [54] for adapting RWM or MALA that aim to achieve fast convergence by constructing proposals that (i) have a high average log-acceptance rate and (ii) have a high entropy. Whereas the entropy of the proposal in RWM or MALA algorithms can be evaluated efficiently, the multi-step nature of the HMC trajectories makes this computation less tractable. The recent work in [41] consider the same adaptation objective by learning a normalising flow that is inspired by a leapfrog proposal with a more tractable entropy by masking components in a leapfrog-style update via an affine coupling layer as used for Real NVPs [19]. [60] sets the integration time by maximizing the proposal entropy for the exact HMC flow in Gaussian targets, while choosing the mass matrix to be the inverse of the sample covariance matrix. 2 Related work The choice of the hyperparameters h, L and M can have a large impact on the efficiency of the sampler. For fixed L and M, a popular approach for adapting h is to target an acceptance rate of around 0.65 which is optimal for iid Gaussian targets in the limit d [8] for a given integration time. HMC hyperparameters have been tuned using some form of expected squared jumping distance (ESJD) [49], using for instance Bayesian optimization [56] or a gradient-based approach [40]. A popular approach suggested in [32] tunes L based on the ESJD by doubling L until the path makes a U-turn and retraces back towards the starting point, that is by stopping to increase L when the distance to the proposed state reaches a stationary point [4]; see also [57] for a variation and [48] for a version using sequential proposals. Modern probabilistic programming languages such as Stan [12], Py MC3 [51], Turing [23, 58] or TFP [39] furthermore allow for an adaptation of a diagonal or dense mass-matrix within NUTS based on the sample covariance matrix. The Riemann manifold HMC algorithm from [25] has been suggested that uses a position dependent mass matrix M(x) based on a non-separable Hamiltonian, but can be computationally expensive, requiring O(d3) operations in general. An alternative to choose M or more generally the kinetic energy K was proposed in [43] by analysing the behaviour of x 7 K( U(x)). Different pre-conditioning approaches have been compared for Gaussian targets in [38]. A popular route has also been to first transform the target using tools from variational inference as in [31] and then run a HMC sampler with unit mass matrix on the transformed density with a more favourable geometry. A common setting to study the convergence of HMC assumes a log-concave target. In the case that U is m1-strongly convex and m2-smooth, [45, 15] analyse the ideal HMC algorithm with unit mass matrix where a higher condition number κ = m2/m1 implies slower mixing: The relaxation time, i.e. the inverse of the spectral gap, grows linear in κ, assuming the integration time is set to T = 1 2 m2 . [14] establish non-asymptotic upper bounds on the mixing time using a leap-frog integrator where the step size h and the number L of steps depends explicitly on m1 and m2. Convergence guarantees are established using conductance profiles by obtaining (i) a high probability lower bound on the acceptance rate and (ii) an overlap bound, that is a lower bound on the KL-divergence between the HMC proposal densities at the starting positions q0 and q 0, whenever q0 is close to q 0. While such bounds for controlling the mixing time might share some similarity with the generalised speed measure (GSM) considered here, they do not lend themselves easily to a gradient-based adaptation. 3 Entropy-based adaptation scheme We derive a novel method to approximate the entropy of the proposed position after L leapfrog steps. Our approximation is based on the assumption that the Hessian of the target is locally constant around the mid-point of the HMC trajectory, which allows for a stochastic trace estimator of the marginal proposal entropy. We develop a penalised loss function that can be minimized using stochastic gradient descent while sampling from the Markov chain in order to optimize a GSM. 3.1 Marginal proposal entropy Suppose that CC = M 1, where C is defined by some parameters θ and can be a diagonal matrix, a full Cholesky factor, etc. Without loss of generality, the step size h > 0 can be fixed. We can reparameterize the momentum resampling step p0 N(0, M) by sampling v N(0, I) and setting p0 = C v. One can show by induction, cf. Appendix E for details, that the L-th step position q L and momentum p L of the leapfrog integrator can be represented as a function of v via q L = TL(v) = q0 Lh2 2 M 1 U(q0) + Lh Cv h2M 1ΞL(v), (3) p L = WL(v) = C v h 2 [ U(q0) + U TL(v)] h i=1 U Ti(v) (4) i=1 (L i) U Ti(v), (5) see also [42, 21, 14] for the special case with an identity mass matrix. Observe that for L = 1 leap-frog steps, this reduces to a MALA proposal with preconditioning matrix M 1. Under regularity conditions, see for instance [21], the transformation TL : Rd Rd is a C1diffeomorphism. With ν denoting the standard Gaussian density, the density r L of the HMC proposal for the position q L after L leapfrog steps is the pushforward density of ν via the map TL so that1 log r L(TL(v)) = log ν(v) log | det DTL(v)|. (6) Observe that the density depends on the Jacobian of the transformation TL : v 7 q L. We would like to avoid computing log | det DTL(v)| exactly. Define the residual transformation SL : Rd Rd, v 7 1 Lh C 1TL(v) v. (7) Then DTL(v) = Lh C(I +DSL(v)) and consequently log | det DTL(v)| = d log(Lh) + log | det C| + log | det(I +DSL(v))|. (8) Combining (6) and (8) yields the log-probability of the HMC proposal log r L(TL(v)) = log ν(v) d log(Lh) log | det C| log | det(I +DSL(v))|. (9) 1We denote the Jacobian matrix of a function f : Rd Rd at the point x as Df(x). Comparing the equations (3) and (7), one sees that SL(v) = c h LC ΞL(v) for some constant c Rd that depends on θ but is independent of v and consequently, DSL(v) = h LC DΞL(v). We next show a recursive expression for DSL with a proof given in Appendix B. Lemma 1 (Jacobian representation). It holds that DS1 = 0 and for any ℓ {2, . . . , L}, v Rd, DSℓ(v) = h2 ℓ 1 X i=1 (ℓ i) i ℓC 2U (Ti(v)) C (I +DSi(v)) . (10) In particular, DSℓ(v) is a symmetric matrix. Suppose further that L2h2 < supq Rd 1 4 C 2U(q)C 2. Then for any ℓ {1, . . . , L} and v Rd, we have DSℓ(v) 2 < 1 Notice that the recursive formula (10) requires computing 1 2L(L 1) terms, each involving the Hessian, in order to compute the Jacobian after L leapfrog steps. Consider for the moment a Gaussian target with potential function U(q) = 1 2(q q ) Σ 1(q q ) for q Rd and positive definite Σ Rd d. Then, due to (10), for any q Rd, v Rd, DSL(v) = h2 L 1 X i=1 (L i) i LC Σ 1C(I +DSi(v)) = DL + RL(v), DL = h2C Σ 1C i=1 (L i) i 6 C Σ 1C (11) and a remainder term RL(v) = h2C Σ 1C PL 1 i=1 (L i) i LDSi(v) . From Lemma 1, we see that if C Σ 1C 2 1 4h2L2 , then I +DSL(v) and DSL(v) are positive definite. Then RL is also positive definite and log det(I +DL) log | det(I +DSL(v))| and we can maximize the lower bound instead. Put differently, for Gaussian targets, DSL can be decomposed into a component DL that contains all terms that are linear in h2C Σ 1C and that does not require a recursion; plus a component RL that contains terms that are higher than linear in h2C Σ 1C and that needs to be solved recursively. Our suggestion is to ignore this second term. Note that R2 = 0 and an extension can be to include higher order terms O h2C Σ 1C k , k > 1, in the approximation DL. For an arbitrary potential energy U, equation (10) shows that evaluating DSL leads to a non-linear function of the Hessians evaluated along the different points of the leapfrog-trajectory. We suggest to replace it with a first order term with one Hessian evaluation which is however scaled accordingly. Concretely, we maximize L(θ) = log | det(I +DL)| with DL = h2 L2 1 6 C 2U(q L/2 )C (12) as an approximation of log | det(I +DSL)|. The intuition is that we assume that the target density can be approximated locally by a Gaussian one with precision matrix Σ 1 in (11) given by the Hessian of U at the mid-point q L/2 of the trajectory. We want to optimize L(θ) given in (12) even if we do not have access to the Hessian 2U explicitly, but only through Hessian-vector products 2U(q)w for some vector w Rd. Vector-Jacobian products vjp(f, x, w) = w Df(x) for differentiable f : Rd Rd can be computed efficiently via reverse-mode automatic differentiation, so that 2U(q)w = vjp( U, q, w) can be evaluated with complexity linear in d. Suppose the multiplication with DL is a contraction so that all eigenvalues of DL have absolute values smaller than one. Then one can apply a Hutchinson stochastic trace estimator of log | det(Id +D,L)| with a Taylor approximation, truncated and re-weighted using a Russian-roulette estimator [44], see also [29, 5, 13] for similar approaches in different settings. More concretely, let N be a positive random variable with support on N and let pk = P (N k). Then, L(θ) = log det(I +DL) = EN,ε kpk ε (DL)k ε where ε is drawn from a Rademacher distribution. While this yields an unbiased estimator for L(θ) and its gradient as shown in Appendix A.1 if DL is contractive, it can be computationally expensive if N has a large mean or have a high variance if DL has an eigenvalue that is close to 1 or 1, see [44, 17]. Since both the first order Gaussian approximation as well as the Russian Roulette estimator hinges on DL having small absolute eigenvalues, we consider a constrained optimisation approach that penalises such large eigenvalues. For the random variable N that determines the truncation level in the Taylor series, we compute b N = (DL)Nε/ (DL)Nε 2 and µN = b NDLb N. Note that this corresponds to applying N times the power iteration algorithm and with |λ1| > |λ2| . . . |λd| denoting the eigenvalues of the symmetric matrix DL, almost surely µn λ1 for n , see [26]. For some δ (0, 1), we choose some differentiable monotone increasing penalty function h: R R such that h(x) > 0 for x > δ and h(x) = 0 for x δ and we add the term γh(|µN|) for γ > 0 to the loss function that we introduce below, see Appendix A.2 for an example of h. 3.2 Adaptation with a generalised speed measure Extending the objective from [54] to adapt the HMC proposal, we aim to solve Z Z π(q0)ν(v) h log a ((q0, v), (TL(v), WL(v))) + β log r L(TL(v)) i dvdq0, (14) where TL, WL, r L as well as the acceptance rate a depend on q0 and the parameters θ we want to adapt. Also, the hyper-parameter β > 0 can be adapted online by increasing β if the acceptance rate is above a target acceptance rate α and decreasing β otherwise. We choose α = 0.67, which is optimal for increasing d under independence assumptions [8]. One part of the objective constitutes minimizing the energy error (q0, v) = H(TL(v), WL(v)) H(q0, C v) that determines the log-acceptance rate via log a(q0, C v) = min{0, (q0, v)}. Unbiased gradients of the energy error can be obtained without stopping any gradient calculations in the backward pass. However, we found that a multi-step extension of the biased fast MALA approximation from [54] tends to improve the adaptation by stopping gradients through U as shown in Appendix A.3. Suppose that the current state of the Markov chain is q. We resample the momentum v N(0, I) and aim to solve (14) by taking gradients of the penalised loss function min{0, (q, v)} β (d log h + log | det C| + L(θ) γh(|µN|)) , as illustrated in Algorithm 1, which also shows how we update the hyperparameters β and γ. The adaptation scheme in Algorithm 1 requires to choose learning rates ρθ, ρβ, ργ and can be viewed within a stochastic approximation framework of controlled Markov chains, see for instance [2, 1, 3]. Different conditions have been established so that infinite adaptive schemes still converge to the correct invariant distribution, such as diminishing adaptation and containment [50]. We have used Adam [37] with a constant step size to adapt the mass matrix, but have stopped the adaptation after some fixed steps so that any convergence is preserved and we leave an investigation of convergence properties of an infinite adaptive scheme for future work. 4 Numerical experiments This section illustrates the mixing performance of the entropy-based sampler for a variety of target densities. First, we consider Gaussian targets either in high dimensions or with a high condition number. Our results confirm (i) that HMC scales better than MALA for high-dimensional Gaussian targets and (ii) that the adaptation scheme learns a mass matrix that is adjusted to the geometry of the target. This is in contrast to adaptation schemes trying to optimize the ESJD [49] or variants thereof [40] that can lead to good mixing in a few components only. Next, we apply the novel adaptation scheme to Bayesian logistic regression models and find that it often outperforms NUTS, except in a few data sets where some components might mix less efficiently. We also compare the entropy-based adaptation with Riemann-Manifold based samplers for a Log-Gaussian Cox point process models. We find that both schemes mix similarly, which indicates that the gradient-based adaptation scheme can learn a suitable mass matrix without having access to the expected Fisher information matrix. Then, we consider a high-dimensional stochastic volatility model where the entropy-based scheme performs favourably compared to alternatives and illustrate that efficient sparsity assumptions can be accommodated when learning the mass matrix. Finally, we show in a toy example how the suggested Algorithm 1 Sample the next state q and adapt β, γ and θ. 1: Sample velocity v N(0, I) and set p = C v. 2: Apply integrator LF to obtain (qℓ, pℓ, U(qℓ))0 ℓ L = LF(q, p). 3: Stop gradients U(qℓ) = stop grad( U(qℓ)) for 0 ℓ L. 4: Compute ΞL(v) using (5). 5: Compute (q0, v) using (16) and set a = min{1, e (q0,v)}. 6: Compute ηN, y = RADEMACHER(). 7: Set L(θ) = stop grad(y) DLε. 8: Set b N = stop grad ηN ηN 2 2 and µN = b NDLb N. 9: E(θ) = min{0, (q0, v)} β (d log h + log | det C| + L(θ) γh(|µN|)) . 10: Adapt θ θ ρθ θE(θ). 11: Adapt β Πβ [β(1 + ρβ(a α )]. #Πβ projects onto a compact set; default value [10 2, 102]. 12: Adapt γ Πγ [γ + ργh(|µN|)]. #Πγ projects onto a compact set; default value [103, 105]. 13: Sample u U(0, 1) and set q = 1{u a} q L + 1{u>a} q. 14: function DL(w): 15: #DL(w) = DLw computes Hessian-vector products efficiently 16: z = vjp( U, stop grad(q L/2 ), Cw) 17: return h2 L2 1 6 C z 18: end function 19: function RADEMACHER: 20: Sample Rademacher random variable ε and truncation level N. 21: Initialise y 0 and η0 = ε. 22: for k = 1...N do 23: #Apply a spectral normalisation for stability if DL is not a contraction; δ (0, 1). 24: Set ηk = DL ηk 1 min {1, δ ηk 1 2 / DL ηk 1 2} and y y + ( 1)k pk ηk. 25: end for 26: return ηN, y 27: end function approach might be modified to sample from highly non-convex potentials. Our implementation2 builds up on tensorflow probability [39] with some target densities taken from [53]. We used 10 parallel chains throughout our experiments to adapt the mass matrix. 4.1 Gaussian targets Anisotropic Gaussian distributions. We consider sampling from a multivariate Gaussian distribution N(0, Σ) with strictly convex potential U(q) = 1 2q Σ 1q for different covariance matrices Σ. For c > 0, assume a covariance matrix given by Σij = δij exp (c(i 1)/(d 1) log 10). We set (i) c = 3 and d {103, 104} and (ii) c = 6 and d = 100, as considered in [52]. The eigenvalues of the covariance matrix are thus distributed between 1 to 100 in setting (i), while they vary from 1 and 106 in setting (ii). The preconditioning factor C is assumed to be diagonal. We adapt the sampler for 4 104 steps in case (i) and for 105 steps in case (ii). We compared it with a NUTS implementation in tensorflow probability (TFP) [39] with a default maximum tree depth of 10 and step sizes adapted using dual averaging [32, 47] that we denote by N in the figures below. Additionally, we consider a further adaptation of NUTS by adapting a diagonal mass matrix using an online variance estimate of the accepted samples as implemented in TFP and denoted AN subsequently. We also consider two objectives as a replacement of the generalised speed measure (GSM): (a) the ESJD and (b) a weighted combination of the ESJD and its inverse as suggested in Levy et al. [40], without any burn-in component, which we denote L2HMC, see Appendix D for a precise definition. We compute the minimum and mean effective sample size (min ESS and mean ESS) of all functions q 7 qi over i {1, . . . , d} as shown in Figure 1a-1b for d = 103 in case (i) with leapfrog steps ranging from L = 1 to 10. It can be observed that HMC adapted with the GSM objective performs 2https://github.com/marcelah/entropy_adaptive_hmc well in terms of min ESS/sec for L > 1, whereas the ESJD or L2HMC objectives yield poor mixing as measured in terms of the min ESS/sec. The mean ESS/sec statistics are more similar for the different objectives. These observations provide some empirical evidence that the ESJD can be high even when some components mix poorly, which has been a major motivation for the GSM objective in [54]. The mass matrix learned using the GSM adapts to the target covariance as can be seen from the the condition numbers of C Σ 1C in Figure 1c becoming relatively close to 1. The GSM objective also yields acceptance rates approaching 1 for increasing leap-frog steps and multiplication with DL becomes a contraction as shown in Appendix F.1, Figure 7. Results for d = 104 can be found in Figure 8 in Appendix F.1 which indicate that as the dimension increases, using more leap-frog steps becomes more advantageous. For the case (ii) of a very ill-conditioned target, results in Table 1 show that the GSM objective leads to better min ESS/sec values, while further statistics shown in Figure 9 illustrate that the GSM also yields to higher min ESS/sec values compared to NUTS with an adapted mass matrix. We want to emphasize that for fixed L, high acceptance rates for HMC need not be disadvantageous. This is illustrated in Figure 11 in Appendix F.4 for a Gaussian target N(0, I) in dimension d = 10, where tuning just the step-size to achieve a target acceptance rate can lead to slow mixing for some L, because the proposal can make a U-turn. Figure 1: Minimum (1a) and mean (1b) effective sample size of q 7 qi per second after adaptation for an anisotropic Gaussian target (d = 1000). The condition number of the transformed Hessian C Σ 1C are shown in (1c). Correlated Gaussian distribution. We sample from a 51-dimensional Gaussian target with covariance matrix given by the squared exponential kernel plus small white noise as in [54], with k(xi, xj) = exp 1 2(xi xj)2/0.42 + .01δij on the regular grid [0, 4]. We consider a general Cholesky factor C. The adaptation is performed over 105 steps. Results over 10 runs are shown in Figure 10 in Appendix F.3 and summarized in Table 2. Table 1: Min ESS/sec for gradientbased adaptation schemes targeting an ill-conditioned Gaussian density (d = 100). Steps GSM ESJD L2HMC 1 122.3 (15.5) 0.1 (0.01) 0.1 (0.01) 5 753.8 (22.2) 0.1 (0.02) 0.1 (0.02) 10 570.0 (37.4) 0.6 (395.2) 0.1 (0.05) Table 2: Min ESS/sec for gradient-based adaptation schemes targeting a correlated Gaussian density (d = 51). Steps GSM ESJD L2HMC 1 63.8 (3.9) 0.8 (1.6) 0.3 (0.1) 5 390.0 (5.0) 2.0 (5.4) 2.7 (2.3) 10 282.7 (7.8) 0.9 (3.7) 0.4 (0.9) 4.2 Logistic regression Consider a Bayesian logistic regression model with n data points yi {0, 1} and d-dimensional covariates xi Rd for i {1, . . . , n}. Assuming a Gaussian prior with covariance matrix Σ0 im- plies a potential function U(q) = Pn i=1 h yix i q + log 1 + ex i q i + 1 2q Σ 1 0 q. We considered six datasets (Australian Credit, Heart, Pima Indian, Ripley, German Credit and Caravan) that are commonly used for benchmarking inference methods, cf. [16]. The state dimension ranges from d = 3 to d = 87. We choose Σ0 = I and parameterize C via a Cholesky matrix. We adapt over 104 steps. HMC with a moderate number of leap-frog steps tends to perform better for four out of six data sets, with subpar performance for the Australian and Caravan data in terms of min ESS/sec, albeit with higher mean ESS/sec across dimensions. The adaptive HMC algorithm tends to perform well if DL is contractive during iterations of the Markov chain such as for the German Credit data set as shown in Figure 2, where the eigenvalues of DL are estimated using a power iteration. If this is not the case as for the Caravan data in Figure 3, the adapted HMC algorithm can perform worse than MALA or NUTS. More detailed diagnostics for all data sets can be found in Appendix G. Figure 2: Minimum (2a) and mean (2b) effective sample size for a Bayesian logistic regression model for German credit data set (d = 25) after adaptation. Estimates of the eigenvalues of DL in 2c. Figure 3: Minimum (3a) and mean (3b) effective sample size for a Bayesian logistic regression model for caravan data set (d = 87) after adaptation. Estimates of the eigenvalues of DL in 3c. 4.3 Log-Gaussian Cox Point Process Inference in a log-Gaussian Cox process model is an ideal setting for Riemann-Manifold (RM) MALA and HMC [25], as a constant metric tensor is used therein that does not depend on the position, making the complexity no longer cubic but only quadratic in the dimension d of the target. Consider an area on [0, 1]2 discretized into grid locations (i, j), for i, j = 1, . . . , n. The observations yij are Poisson distributed and conditionally independent given a latent intensity process {λ}ij with means λij = m exp(xij) for m = n 2 and a latent vector x drawn from a Gaussian process with constant mean µ and covariance function Σ(i,j),(i ,j ) = σ2 x exp{ p (i i )2 + (j j )2/(nβ)}. The target is proportional to p(y, x) Qn n i,j exp [yijxij m exp(xij)] exp (x µ1) Σ 1(x µ1)/2 . For the RM based samplers, the preconditioning matrix is M = Λ + Σ 1 where Λ is a diagonal matrix with diagonal elements {m exp(µ + Σii)}i and step sizes adapted using dual averaging. We generate simulated data for d {64, 256} and adapt for 2000 steps using a Cholesky factor C. Figure 18 in Appendix H illustrates that the entropy-based adaptation can achieve a higher min ESS/sec score for d = 64 with high acceptance rates for increasing leap-frog steps. The RM samplers perform slightly better in terms of min ESS/sec for d = 256, see Figure 4 and Figure 19 for a comparison of the inverse mass matrices. 4.4 Stochastic volatility model We consider a stochastic volatility model [36, 34] that has been used with minor variations for adapting HMC [25, 32, 57]. Assume that the latent log-volatilities follow an autoregressive AR(1) Figure 4: Minimum (4a) and mean (4b) effective sample size for a Cox process in dimension d = 256 after adaptation. Estimates of the eigenvalues of DL using power iteration in (4c). process so that h1 N(0, σ2/(1 φ2)) and for t {1, . . . , T 1}, ht+1 = φht + ηt+1 with ηt N(0, σ2). The observations follow the dynamics yt|ht N(0, exp(µ + ht)). The prior distributions for the static parameters are: the persistence of the log-volatility process (φ + 1)/2 Beta(20, 1.5); the mean log-volatility µ Cauchy(0, 2); and the scale of the white-noise process σ Half-Cauchy(0, 1). We reparametrize φ and σ with a sigmoidand softplus-transformation, respectively. Observe that the precision matrix of the AR(1) process is tridiagonal. Since a Cholesky factor of such a matrix is tridiagonal, we consider the parameterization C = B 1 θ for an uppertriangular and tridiagonal matrix Bθ. The required operations with such banded matrices have a complexity of O(d), see for instance [22]. For comparison, we also consider a diagonal matrix C. We apply the model to ten years of daily returns of the S&P500 index, giving rise to a target dimension of d = 2519. In order to account for the different number of gradient evaluations, we use 3.5 104/L steps for the adaptation and for evaluating the sampler based on L {1, . . . , 10} leapfrog steps. We run NUTS for 1000 steps which has a four times higher run-time compared to the other samplers. In addition to using effective sample size to assess convergence, we also report the potential scale reduction factor splitˆR [24, 55] where large values are indicative of poor mixing. We report results over three replications in Figure 5 with more details in Figure 20, Appendix I. First, HMC with moderately large L tends to improve the effective samples per computation time compared to the MALA case, while also having a smaller ˆR. Second, using a tridiagonal mass matrix improves mixing compared to a diagonal one, particularly for the latent log-volatilities as seen in the median ESS/sec or median ˆR values. The largest absolute eigenvalue of DL tends to be smaller for a tridiagonal mass matrix and the acceptance rates are approaching 100% more slowly for increasing L. Third, NUTS seems less efficient as does using a dual-adaptation scheme. We imagine that similar efficient parameterizations of M or M 1 can be used for different generalisations of the above stochastic volatility model, such as including p sub-diagonals for log-volatilities having a higher-order AR(p) dynamics or multivariate extensions using a suitable block structure. Likewise, this approach might also be useful for inferences in different Gaussian Markov Random Field models with sparse precision matrices. Figure 5: Minimum (5a) and median (5b) effective sample size per second and maximum ˆR of q 7 qi for a stochastic volatility model (d = 2519) after adaptation. 4.5 Learning non-linear transformations To illustrate an extension to sample from highly non-convex targets by learning a non-linear transformation within the suggested framework as explained in greater detail in Appendix C, we consider sampling from a two-dimensional Banana distribution that results from the transformation of N(0, Λ) where Λ is a diagonal matrix having entries 100 and 1 via the volume-preserving map φb(x) = (x1, x2 + b(x2 1 100)), for b = 0.1, cf. [27]. We consider a Real NVPtype [19] transformation f = f3 f2 f1 where f1(x1, x2) = (x1, x2 g(s(x1)) + t(x1), f2(x1, x2) = (x1 g(s(x2)) + t(x1), x2) and f3(x1, x2) = (c1x1, c2x2). The functions s and t are neural networks with two hidden layers of size 50. For numerical stability, we found it beneficial to use a modified affine scaling function g as a sigmoid function scaled on a restricted range such as (0.5, 2), as also suggested in [6]. As an alternative, we also consider learning a linear transformation f(x) = Cx for a Cholesky matrix C as well as NUTS and a standard HMC sampler with step size adapted to achieve a target acceptance rate of 0.65. Figure 6 summarizes the ESS where each method uses 4 105 samples before and after the adaptation. Whereas a linear transformation does not improve on standard HMC, non-linear transformations can improve the mixing efficiency. Figure 6: Minimum (6a) and mean (6b) effective sample size per second as well as minimum effective sample size (6c) for a Banana-shaped target in dimension d = 2 after adaptation. 5 Discussion and Outlook Limitations. Our approach to learn a constant mass matrix can struggle for targets where the Hessian varies greatly across the state space, which can yield relatively short integration times with very high acceptance rates. While this effect might be mitigated by considering non-linear transformations, it remains challenging to learn flexible transformations efficiently in high dimensions. Variations of the entropy objective. Recent work [18, 11] have suggested to add the crossentropy term R π(q) R r(q |q) log π(q )dq dq to the entropy objective for optimizing the parameters of a Metropolis-Hastings kernel with proposal density r(q |q). Algorithm 1 can be adjusted to such variations, possibly by stopping gradients through U as for optimizing the energy error term. Variations of HMC. We have considered a standard HMC setting for a fixed number of leap-frog steps. One could consider a mixture of HMC kernels with different numbers of leap-frog steps and an interesting question would be how to learn the different mass matrices jointly in an efficient way. Instead of a full velocity refreshment, partial refreshment strategies [33] can sometimes mix better. The suggested adaptation approach can yield very high acceptance rates particularly for increasing leap-frog steps and the learned mass matrix can be used with a partial refreshment. However, it would be interesting to analyse if the adaptation can be adjusted to such persistent velocity updates. It would also be of interest to analyse if similar ideas can be used to adapt different numerical integrators such as those suggested in [7] for target densities relative to a Gaussian measure or for multinomial HMC with an additional intra-trajectory sampling step [9, 59]. Our focus was on learning a mass matrix so that samples from the Markov chain can be used for estimators that are consistent for increasing iterations. However, unbiased estimators might also be constructed using coupled HMC chains [30] and one might ask if the adapted mass matrix leads to shorter meeting times in such a setting. Acknowledgements The authors acknowledge the use of the UCL Myriad High Performance Computing Facility (Myriad@UCL), and associated support services, in the completion of this work. Funding Transparency Statement There are no additional sources of funding to disclose. [1] Christophe Andrieu and Eric Moulines. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3):1462 1505, 2006. [2] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC. Statistics and computing, 18(4):343 373, 2008. [3] Christophe Andrieu, Vladislav B Tadic, and Matti Vihola. On the stability of some controlled Markov chains and its applications to stochastic approximation with Markovian dynamic. The Annals of Applied Probability, 25(1):1 45, 2015. [4] Christophe Andrieu, Anthony Lee, and Sam Livingstone. A general perspective on the Metropolis-Hastings kernel. ar Xiv preprint ar Xiv:2012.14881, 2020. [5] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible Residual Networks. In International Conference on Machine Learning, pages 573 582, 2019. [6] Jens Behrmann, Paul Vicol, Kuan-Chieh Wang, Roger Grosse, and J orn-Henrik Jacobsen. Understanding and mitigating exploding inverses in invertible neural networks. In International Conference on Artificial Intelligence and Statistics, pages 1792 1800. PMLR, 2021. [7] Alexandros Beskos, Frank J Pinski, Jes us Maria Sanz-Serna, and Andrew M Stuart. Hybrid Monte Carlo on Hilbert spaces. Stochastic Processes and their Applications, 121(10):2201 2230, 2011. [8] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501 1534, 2013. [9] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1701.02434, 2017. [10] Nawaf Bou-Rabee and Jes us Maria Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113 206, 2018. [11] Chris Cannella and Vahid Tarokh. Semi-Empirical Objective Functions for MCMC Proposal Optimization. ar Xiv preprint ar Xiv:2106.02104, 2021. [12] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76(1):1 32, 2017. [13] Tian Qi Chen, Jens Behrmann, David K Duvenaud, and J orn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pages 9913 9923, 2019. [14] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients. ar Xiv preprint ar Xiv:1905.12247, 2019. [15] Zongchen Chen and Santosh S Vempala. Optimal convergence rate of Hamiltonian Monte Carlo for strongly logconcave distributions. ar Xiv preprint ar Xiv:1905.02313, 2019. [16] Nicolas Chopin, James Ridgway, et al. Leave Pima Indians alone: binary regression as a benchmark for Bayesian computation. Statistical Science, 32(1):64 87, 2017. [17] Rob Cornish, Anthony L Caterini, George Deligiannidis, and Arnaud Doucet. Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows. ar Xiv preprint ar Xiv:1909.13833, 2019. [18] Ameer Dharamshi, Vivian Ngo, and Jeffrey S Rosenthal. Sampling by Divergence Minimization. ar Xiv preprint ar Xiv:2105.00520, 2021. [19] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. ar Xiv preprint ar Xiv:1605.08803, 2016. [20] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987. [21] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1705.00166, 2017. [22] Nicolas Durrande, Vincent Adam, Lucas Bordeaux, Stefanos Eleftheriadis, and James Hensman. Banded matrix operators for Gaussian Markov models in the automatic differentiation era. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2780 2789. PMLR, 2019. [23] Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: A language for flexible probabilistic inference. In International conference on artificial intelligence and statistics, pages 1682 1690. PMLR, 2018. [24] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. CRC press, 2013. [25] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2):123 214, 2011. [26] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012. [27] Heikki Haario, Eero Saksman, and Johanna Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375 395, 1999. [28] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated by the St ormer Verlet method. Acta numerica, 12:399 450, 2003. [29] Insu Han, Haim Avron, and Jinwoo Shin. Stochastic Chebyshev gradient descent for spectral optimization. In Advances in Neural Information Processing Systems, pages 7386 7396, 2018. [30] Jeremy Heng and Pierre E Jacob. Unbiased Hamiltonian Monte Carlo with couplings. Biometrika, 106(2):287 302, 2019. [31] Matthew Hoffman, Pavel Sountsov, Joshua V Dillon, Ian Langmore, Dustin Tran, and Srinivas Vasudevan. Neutra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. ar Xiv preprint ar Xiv:1903.03704, 2019. [32] Matthew D Hoffman and Andrew Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014. [33] Alan M Horowitz. A generalized guided Monte Carlo algorithm. Physics Letters B, 268(2): 247 252, 1991. [34] Eric Jacquier, Nicholas G Polson, and Peter E Rossi. Bayesian analysis of stochastic volatility models. Journal of Business & Economic Statistics, 20(1):69 87, 2002. [35] Leif T Johnson and Charles J Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. The Annals of Statistics, 40(6):3050 3076, 2012. [36] Sangjoon Kim, Neil Shephard, and Siddhartha Chib. Stochastic volatility: likelihood inference and comparison with ARCH models. The review of economic studies, 65(3):361 393, 1998. [37] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [38] Ian Langmore, Michael Dikovsky, Scott Geraedts, Peter Norgaard, and Rob Von Behren. A condition number for Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1905.09813, 2019. [39] Junpeng Lao, Christopher Suter, Ian Langmore, Cyril Chimisov, Ashish Saxena, Pavel Sountsov, Dave Moore, Rif A Saurous, Matthew D Hoffman, and Joshua V Dillon. tfp. mcmc: Modern Markov Chain Monte Carlo tools built for modern hardware. ar Xiv preprint ar Xiv:2002.01184, 2020. [40] Daniel Levy, Matt D Hoffman, and Jascha Sohl-Dickstein. Generalizing Hamiltonian Monte Carlo with neural networks. In International Conference on Learning Representations, 2018. [41] Zengyi Li, Yubei Chen, and Friedrich T Sommer. A Neural Network MCMC sampler that maximizes Proposal Entropy. ar Xiv preprint ar Xiv:2010.03587, 2020. [42] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli, 25(4A):3109 3138, 2019. [43] Samuel Livingstone, Michael F Faulkner, and Gareth O Roberts. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika, 106(2):303 319, 2019. [44] Anne-Marie Lyne, Mark Girolami, Yves Atchad e, Heiko Strathmann, Daniel Simpson, et al. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical science, 30(4):443 467, 2015. [45] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. ar Xiv preprint ar Xiv:1708.07114, 2017. [46] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. [47] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009. [48] Joonha Park and Yves Atchad e. Markov chain Monte Carlo algorithms with sequential proposals. Statistics and Computing, 30(5):1325 1345, 2020. [49] Cristian Pasarica and Andrew Gelman. Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Statistica Sinica, pages 343 364, 2010. [50] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of applied probability, 44(2):458 475, 2007. [51] John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using Py MC3. Peer J Computer Science, 2:e55, 2016. [52] Jascha Sohl-Dickstein, Mayur Mudigonda, and Michael De Weese. Hamiltonian Monte Carlo Without Detailed Balance. In International Conference on Machine Learning, pages 719 726, 2014. [53] Pavel Sountsov, Alexey Radul, and contributors. Inference gym, 2020. URL https://pypi. org/project/inference_gym. [54] Michalis Titsias and Petros Dellaportas. Gradient-based Adaptive Markov Chain Monte Carlo. In Advances in Neural Information Processing Systems, pages 15704 15713, 2019. [55] Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian B urkner. Rank-normalization, folding, and localization: An improved R for assessing convergence of MCMC. Bayesian analysis, 1(1):1 28, 2021. [56] Ziyu Wang, Shakir Mohamed, and Nando Freitas. Adaptive Hamiltonian and Riemann Manifold Monte Carlo. In International conference on machine learning, pages 1462 1470. PMLR, 2013. [57] Changye Wu, Julien Stoehr, and Christian P Robert. Faster Hamiltonian Monte Carlo by learning leapfrog scale. ar Xiv preprint ar Xiv:1810.04449, 2018. [58] Kai Xu, Hong Ge, Will Tebbutt, Mohamed Tarek, Martin Trapp, and Zoubin Ghahramani. Advancedhmc. jl: A robust, modular and efficient implementation of advanced HMC algorithms. In Symposium on Advances in Approximate Bayesian Inference, pages 1 10. PMLR, 2020. [59] Kai Xu, Tor Erlend Fjelde, Charles Sutton, and Hong Ge. Couplings for Multinomial Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3646 3654. PMLR, 2021. [60] Tengchao Yu, Hongqiao Wang, and Jinglai Li. Maximum conditional entropy Hamiltonian Monte Carlo sampler. SIAM Journal on Scientific Computing, 2021.