# gradientbased_adaptive_markov_chain_monte_carlo__bf62138a.pdf

Gradient-based Adaptive Markov Chain Monte Carlo

Michalis K. Titsias Deep Mind London, UK mtitsias@google.com

Petros Dellaportas Department of Statistical Science University College of London, UK Department of Statistics, Athens Univ. of Econ. and Business, Greece and The Alan Turing Institute, UK

We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. We deﬁne a maximum entropy regularised objective function, referred to as generalised speed measure, which can be robustly optimised over the parameters of the proposal distribution by applying stochastic gradient optimisation. An advantage of our method compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. This is a highly desirable property of any adaptation strategy because the adaptation starts in early iterations even if the initial proposal distribution is far from optimum. We apply the framework for learning multivariate random walk Metropolis and Metropolis-adjusted Langevin proposals with full covariance matrices, and provide empirical evidence that our method can outperform other MCMC algorithms, including Hamiltonian Monte Carlo schemes.

1 Introduction

Markov chain Monte Carlo (MCMC) is a family of algorithms that provide a mechanism for generating dependent draws from arbitrarily complex distributions. The basic set up of an MCMC algorithm in any probabilistic (e.g. Bayesian) inference problem, with an intractable target density π(x), is as follows. A discrete time Markov chain {Xt} t=0 with transition kernel Pθ, appropriately chosen from a collection of π-invariant kernels {Pθ( , )}θ Θ, is generated and the ergodic averages µt(F) = t 1 Pt 1 i=0 F(Xi) are used as approximations to Eπ(F) for any real-valued function F. Although in principle this sampling setup is simple, the actual implementation of any MCMC algorithm requires careful choice of Pθ because the properties of µt depend on θ. In common kernels that lead to random walk Metropolis (RWM), Metropolis-adjusted Langevin (MALA) or Hamiltonian Monte Carlo (HMC) algorithms the kernels Pθ are speciﬁed through an accept-reject mechanism in which the chain moves from time t to time t + 1 by ﬁrst proposing candidate values y from a density qθ(y|x) and accepting them with some probability α(xt, y) and setting xt+1 = y, or rejecting them and setting xt+1 = xt. Since θ directly affects this acceptance probability, it is clear that one should choose θ so that the chain does not move too slowly or rejects too many proposed values y because in both these cases convergence to the stationary distribution will be slow. This has been recognised as early as in [22] and has initiated exciting research that has produced optimum average acceptance probabilities for a series of algorithms; see [30, 31, 32, 15, 6, 8, 34, 7, 35, 9]. Such optimal average acceptance probabilities provide basic guidelines for adapting single step size parameters to achieve certain average acceptance rates.

More sophisticated adaptive MCMC algorithms that can learn a full set of parameters θ, such as a covariance matrix, borrow information from the history of the chain to optimise some criterion reﬂecting the performance of the Markov chain [14, 5, 33, 13, 2, 1, 4]. Such methods are typically

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

non gradient-based and the basic strategy they use is to sequentially ﬁt the proposal qθ(y|x) to the history of states xt 1, xt, . . . , by ignoring also the rejected state values. This can result in very slow adaptation because the initial Markov chain simulations are based on poor initial θ and the generated states, from which θ is learnt, are highly correlated and far from the target. The authors in [34] call such adaptive strategies greedy in the sense that they try to adapt too closely to initial information from the output and take considerable time to recover from misleading initial information.

In this paper, we develop faster and more robust gradient-based adaptive MCMC algorithms that make use of the gradient of the target, log π(x), and they learn from both actual states of the chain and proposed (and possibly rejected) states. The key idea is to deﬁne and maximise w.r.t. θ an entropy regularised objective function that promotes high acceptance rates and high values for the entropy of the proposal distribution. This objective function, referred to as generalised speed measure, is inspired by the speed measure of the inﬁnite-dimensional limiting diffusion process that captures the notion of speed in which a Markov chain converges to its stationary distribution [32]. We maximise this objective function by applying stochastic gradient variational inference techniques such as those based on the reparametrisation trick [19, 29, 40]. An advantage of our algorithm compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. In fact, the adaptation can be faster when candidate values y are rejected since then we make always full use of the gradient log π(y) evaluated at the rejected y. This allows the adaptation to start in early iterations even if the initial proposal distribution is far from optimum and the chain is not moving. We apply the method for learning multivariate RWM and MALA proposals where we adapt full covariance matrices parametrised efﬁciently using Cholesky factors. In the experiments we demonstrate our algorithms to multivariate Gaussian targets and Bayesian logistic regression and empirically show that they outperform several other baselines, including advanced HMC schemes.

2 Gradient-based adaptive MCMC

Assume a target distribution π(x), known up to some unknown normalising constant, where x Rn is the state vector. To sample from π(x) we consider the Metropolis-Hastings (M-H) algorithm that generates new states by sampling from a proposal distribution qθ(y|x), that depends on parameters θ, and accepts or rejects each proposed state by using the standard M-H acceptance probability

α(x, y; θ) = min 1, π(y)qθ(x|y)

π(x)qθ(y|x)

While the M-H algorithm deﬁnes a Markov chain that converges to the target distribution, the efﬁciency of the algorithm heavily depends on the choice of the proposal distribution qθ(x|y) and the setting of its parameters θ.

Here, we develop a framework for stochastic gradient-based adaptation or learning of qθ(x|y) that maximises an objective function inspired by the concept of speed measure that underlies the theoretical foundations of MCMC optimal tuning [30, 31]. Given that the chain is at state x we would like: (i) to propose big jumps in the state space and (ii) accept these jumps with high probability. By assuming for now that the proposal has the standard random walk isotropic form, such that qσ(y|x) = N(y|x, σ2I), the speed measure is deﬁned as sσ(x) = σ2 α(x; σ), (2)

where σ2 denotes the variance, also called step size, of the proposal distribution, while α(x; σ) is the average acceptance probability when starting at x, i.e. α(x; σ) = R α(x, y; σ)qσ(y|x)dy. To learn a good value for the step size we could maximise the speed measure in Eq. 2, which intuitively promotes high variance for the proposal distribution together with high acceptance rates. In the theory of optimal MCMC tuning, sσ(x) is averaged under the stationary distribution π(x) to obtain a global speed measure value sσ = R π(x)sσ(x)dx. For simple targets and with increasing dimension this measure is maximised so that σ2 is set to a value that leads to the acceptance probability 0.234 [30, 31]. This subsequently leads to the popular heuristic for tuning random walk proposals: tune σ2 so that on average the proposed states are accepted with probability 1/4. Similar heuristics have been obtained for tuning the step sizes of more advanced schemes such as MALA and HMC, where 0.574 is considered optimal for MALA [32] and 0.651 for HMC [24, 9].

While the current notion of speed measure from Eq. 2 is intuitive, it is only suitable for tuning proposals having a single step size. Thus, in order to learn arbitrary proposal distributions qθ(y|x),

where θ is a vector of parameters, we need to deﬁne suitable generalisations of the speed measure. Suppose, for instance, that we wish to tune a Gaussian with a full covariance matrix, i.e. qΣ(y|x) = N(y|x, Σ). To achieve this we need to generalise the objective in Eq. 2 by replacing σ2 with some functional F(Σ) that depends on the full covariance. An obvious choice is to consider the average distance ||y x||2 given by the trace tr(Σ) = P

i σ2 i . However, this is problematic since it can lead to learning proposals with very poor mixing. To see this note that since the trace is the sum of variances it can obtain high values even when some of the components of x have very low variance, e.g. for some xi it holds σ2 i 0, which can result in very low sampling efﬁciency or even non-ergodicity. In order to deﬁne better functionals F(Σ) we wish to exploit the intuition that for MCMC all components of x need to jointly perform (relative to their scale) big jumps, a requirement that is better captured by the determinant |Σ| or more generally by the entropy of the proposal distribution.

Therefore, we introduce a generalisation of the speed measure having the form,

sθ(x) = exp{βHqθ(y|x)} α(x; θ) = exp{βHqθ(y|x)} Z α(x, y; θ)qθ(y|x)dy, (3)

where Hqθ(y|x) = R qθ(y|x) log qθ(y|x)dy denotes the entropy of the proposal distribution and β > 0 is an hyperparameter. Note that when the proposal distribution is a full Gaussian qΣ(y|x) = N(y|x, Σ) then exp{βHq(y|x)} = const |Σ| β 2 which depends on the determinant of Σ. sθ(x), referred to as generalised speed measure, trades off between high entropy of the proposal distribution and high acceptance probability. The hyperparameter β plays the crucial role of balancing the relative strengths of these terms. As discussed in the next section we can efﬁciently optimise β in order to achieve a desirable average acceptance rate.

In the following we make use of the above generalised speed measure to derive a variational learning algorithm for adapting the parameters θ using stochastic gradient-based optimisation.

2.1 Maximising the generalised speed measure using variational inference

During MCMC iterations we collect the pairs of vectors (xt, yt)t>0 where xt is the state of the chain at time t and yt the corresponding proposed next state, which if accepted then xt+1 = yt. When the chain has converged each xt follows the stationary distribution π(x), otherwise it follows some distribution that progressively converges to π(x). In either case we view the sequence of pairs (xt, yt) as non-iid data based on which we wish to perform gradient-based updates of the parameters θ. In practice such updates can be performed with diminishing learning rates, or more safely completely stop after some number of burn-in iterations to ensure convergence. Speciﬁcally, given the current state xt we wish to take a step towards maximising sθ(xt) in Eq. 3 or equivalently its logarithm,

log sθ(xt) = log Z α(x, y; θ)qθ(y|xt)dy + βHqθ(y|xt). (4)

The second term is just the entropy of the proposal distribution, which typically will be analytically tractable, while the ﬁrst term involves an intractable integral. To approximate the ﬁrst term we work similarly to variational inference [18, 10] and we lower bound it using Jensen s inequality,

log sθ(xt) Fθ(xt) = Z qθ(y|xt) log min 1, π(y)qθ(xt|y)

π(xt)qθ(y|xt)

dy + βHqθ(y|xt) (5)

= Z qθ(y|xt)min 0, log π(y)

π(xt) + log qθ(xt|y)

dy + βHqθ(y|xt). (6)

To take a step towards maximising Fθ we can apply standard stochastic variational inference techniques such as the score function method or the reparametrisation trick [11, 26, 28, 19, 29, 40, 20]. Here, we focus on the case qθ(y|xt) is a reparametrisable distribution such that y = Tθ(xt, ϵ) where Tθ is a deterministic transformation and ϵ p(ϵ). Fθ(xt) can be re-written as

Fθ(xt) = Z p(ϵ)min 0, log π(Tθ(xt, ϵ))

π(xt) + log qθ(xt|Tθ(xt, ϵ))

qθ(Tθ(xt, ϵ)|xt)

dϵ + βHqθ(y|xt).

Since MCMC at the t-th iteration proposes a speciﬁc state yt constructed as ϵt p(ϵt), yt = Tθ(xt, ϵt), an unbiased estimate of the exact gradient θFθ(xt) can be obtained by

θFθ(xt, ϵt) = θmin 0, log π(Tθ(xt, ϵt))

π(xt) + log qθ(xt|Tθ(xt, ϵt))

qθ(Tθ(xt, ϵt)|xt)

+ β θHqθ(y|xt),

Algorithm 1 Gradient-based Adaptive MCMC

Input: target π(x); reparametrisable proposal qθ(y|x) s.t. y = Tθ(x, ϵ), ϵ p(ϵ); initial x0; desired average acceptance probability α . Initialise θ, β = 1. for t = 1, 2, 3, . . . , do

: Propose ϵt p(ϵt), yt = Tθ(xt, ϵt). : Adapt θ: θ θ + ρt θFθ(xt, ϵt). : Accept or reject yt using the standard M-H ratio to obtain xt+1. : Set αt = 1 if yt was accepted and αt = 0 otherwise. : Adapt hyperparameter β: β β[1 + ρβ(αt α )] # default value for ρβ = 0.02. end for

which is used to make a gradient update for the parameters θ. Note that the ﬁrst term in the stochastic gradient is analogous to differentiating through a rectiﬁed linear hidden unit (Re Lu) in neural networks, i.e. if log π(yt)

π(xt) + log qθ(xt|yt)

qθ(yt|xt) 0 the gradient is zero (this is the case when yt is accepted with probability one), while otherwise the gradient of the ﬁrst term reduces to

θ log π(Tθ(xt, ϵt)) + θ log qθ(xt|Tθ(xt, ϵt))

qθ(Tθ(xt, ϵt)|xt).

The value of the hyperparameter β can allow to trade off between large acceptance probability and large entropy of the proposal distribution. Such hyperparameter cannot be optimised by maximising the variational objective Fθ (this typically will set β to a very small value so that the acceptance probability becomes close to one but the chain is not moving since the entropy is very low). Thus, β needs to be updated in order to control the average acceptance probability of the chain in order to achieve a certain desired value α . The value of α can be determined based on the speciﬁc MCMC proposal we are using and by following standard recommendations in the literature, as discussed also in the previous section. For instance, when we use RWM α can be set to 1/4 (see Section 2.2), while for gradient-based MALA (see Section 2.3) α can be set to 0.55.

Pseudocode for the general procedure is given by Algorithm 1. We set the learning rate ρt using RMSProp [39]; at each iteration t we set ρt = η/(1+ Gt), where η is the baseline learning rate, and the updates of Gt depend on the gradient estimate θFθ(xt, ϵt) as Gt = 0.9Gt+0.1 [ θFθ(xt, ϵt)]2.

2.2 Fitting a full covariance Gaussian random walk proposal

We now specialise to the case the proposal distribution is a random walk Gaussian q L(y|x) = N(y|x, LL ) where the parameter L is a positive deﬁnite lower triangular matrix, i.e. a Cholesky factor. This distribution is reparametrisable since y TL(x, ϵ) = x + Lϵ, ϵ N(ϵ|0, I). At the t-th iteration when the state is xt the lower bound becomes

FL(xt) = Z N(ϵ|0, I)min {0, log π(xt + Lϵ) log π(xt)} dϵ + β

i=1 log Lii + const. (7)

Here, the proposal distribution has cancelled out from the M-H ratio, since it is symmetric, while Hqθ(y|xt) = log |L| + const and log |L| = Pn i=1 log Lii. By making use of the MCMC proposed state yt = xt + Lϵt we can obtain an unbiased estimate of the exact gradient LFL(xt),

LFL(xt, ϵt) =

( yt log π(yt) ϵ t

lower + βdiag( 1 L11 , . . . , 1 Lnn ), if log π(yt) < log π(xt) βdiag( 1 L11 , . . . , 1 Lnn ), otherwise

where yt = xt +Lϵt, the operation [A]lower zeros the upper triangular part (above the main diagonal) of a squared matrix and diag( 1 L11 , . . . , 1 Lnn ) is a diagonal matrix with elements 1/Lii. The ﬁrst case of this gradient, i.e. when log π(yt) < log π(xt), has a very similar structure with the stochastic reparametrisation gradient when ﬁtting a variational Gaussian approximation [19, 29, 40] with the difference that here we centre the corresponding approximation, i.e. the proposal q L(yt|xt), at the current state xt instead of having a global variational mean parameter. Interestingly, this ﬁrst case when MCMC rejects many samples (or even it gets stuck at the same value for long time) is when

learning can be faster since the term yt log π(yt) ϵ t transfers information about the curvature of the target to the covariance of the proposal. When we start getting high acceptance rates the second case, i.e. log π(yt) log π(xt), will often be activated so that the gradient will often reduce to only having the term βdiag( 1 L11 , . . . , 1 Lnn ) that encourages the entropy of the proposal to become large. The ability to learn from rejections is in sharp contrast with the traditional non gradient-based adaptive MCMC methods which can become very slow when MCMC has high rejection rates. This is because these methods typically learn from the history of state vectors xt by ignoring the information from the rejected states. The algorithm for learning the full random walk Gaussian follows precisely the general structure of Algorithm 1. For the average acceptance rate α we use the value 1/4.

2.3 Fitting a full covariance MALA proposal

Here, we specialise to a full covariance, also called preconditioned, MALA of the form q L(y|x) = N(y|x + (1/2)LL x log π(x), LL ) where the covariance matrix is parametrised by the Cholesky factor L. Again this distribution is reparametrisable according to y TL(x, ϵ) = x + (1/2)LL log π(x) + Lϵ, ϵ N(ϵ|0, I). At the t-th iteration when the state is xt the reparametrised lower bound simpliﬁes signiﬁcantly and reduces to,

FL(xt) = Z N(ϵ|0, I)min n 0, log π xt + (1/2)LL log π(xt) + Lϵ log π(xt)

||(1/2)L [ log π(xt) + log π(y)] + ϵ||2 ||ϵ||2 o dϵ + β

i=1 log Lii + const,

where || || denotes Euclidean norm and in the term log π(y), y = xt+(1/2)LL log π(xt)+Lϵ. Then, based on the proposed state yt = TL(xt, ϵt) we can obtain the unbiased gradient estimate FL(xt, ϵt) similarly to the previous section. In general, such an estimate can be very expensive because the existence of L inside log π(yt) means that we need to compute the matrix of second derivatives or Hessian log π(yt). We have found that an alternative procedure which stops the gradient inside log π(yt) (i.e. it views log π(yt) as a constant w.r.t. L) has small bias and works well in practice. In fact, as we will show in the experiments this approximation not only is computationally much faster but remarkably also it leads to better adaptation compared to the exact Hessian procedure, presumably because by not accounting for the gradient inside log π(yt) reduces the variance. Furthermore, the expression of the gradient w.r.t. L used by this fast approximation can be computed very efﬁciently with a single O(n2) operation (an outer vector product; see Supplement), while each iteration of the algorithm requires overall at most four O(n2) operations. For these gradient-based adaptive MALA schemes, β in Algorithm 1 is adapted to obtain an average acceptance rate roughly α = 0.55.

3 Related Work

Connection of our method with traditional adaptive MCMC methods has been discussed in Section 1. Here, we analyse additional related works that make use of gradient-based optimisation and specialised objective functions or algorithms to train MCMC proposal distributions.

The work in [21] proposed a criterion to tune MCMC proposals based on maximising a modiﬁed version of the expected squared jumped distance, R qθ(y|xt)||y xt||2α(xt, y; θ)dy, previously considered in [27]. Speciﬁcally, the authors in [21] ﬁrstly observe that the expected squared jumped distance may not encourage mixing across all dimensions of x1 and then try to resolve this by including a reciprocal term (see Section 4.2 in their paper). The generalised speed measure proposed in this paper is rather different from such criteria since it encourages joint exploration of all dimensions of x by applying maximum entropy regularisation, which by construction penalises "dimensions that do not move" since the entropy becomes minus inﬁnity in such cases. Another important difference is that in our method the optimisation is performed in the log space by propagating gradients through the logarithm of the M-H acceptance probability, i.e. through log α(xt, y; θ) and not through α(xt, y; θ). This is exactly analogous to other numerically stable objectives such as variational lower bounds and log likelihoods, and as those our method leads to numerically stable optimisation for arbitrarily large dimensionality of x and complex targets π(x).

1Because the additive form of ||y xt||2 = P

i(yi xti)2 implies that even when some dimensions might not be moving at all (the corresponding distance terms are zero), the overall sum can still be large.

In another related work, the authors in [25] considered minimising the KL divergence KL[π(xt)qθ(yt|xt)||π(yt)qθ(xt|yt)]. However, this loss for standard proposal schemes, such as RWM and MALA, leads to degenerate deterministic solutions where qθ(yt|xt) collapses to a delta function. Therefore, [25] maximised this objective for the independent M-H sampler where the collapsing problem does not occur. The entropy regularised objective we introduced is different and it can adapt arbitrary MCMC proposal distributions, and not just the independent M-H sampler.

There has been also work to learn ﬂexible MCMC proposals using neural networks [38, 21, 16, 36]. For instance, [38] use volume preserving ﬂows and an adversarial objective, [21] use the modiﬁed expected jumped distance, discussed earlier, to learn neural network-based extensions of HMC, while [16, 36] use auxiliary variational inference. The need to train neural networks can add a signiﬁcant computational cost, and from the end-user point of view these neural adaptive samplers might be hard to tune especially in high dimensions. Notice that the generalised speed measure we proposed in this paper could possibly be used to train neural adaptive samplers as well. However, to really obtain practical algorithms we need to ensure that training has small cost that does not overwhelm the possible beneﬁts in terms of effective sample size.

Finally, the generalised speed measure that is based on entropy regularisation shares similarities with other used objectives for learning probability distributions, such as in variational Bayesian inference, where the variational lower bound includes an entropy term [18, 10] and reinforcement learning (RL) where maximum-entropy regularised policy gradients are able to estimate more explorative policies [37, 23]. Further discussion on the resemblance of our algorithm with RL is given in the Supplement.

4 Experiments

We test the gradient-based adaptive MCMC methods in several simulated and real data. We investigate the performance of two instances of the framework: the gradient-based adaptive random walk (gad RWM) detailed in Section 2.2 and the corresponding MALA (gad MALA) detailed in Section 2.3. For gad MALA we consider the exact reparametrisation method that requires the evaluation of the Hessian (gad MALAe) and the fast approximate variant (gad MALAf). These schemes are compared against: (i) standard random walk Metropolis (RWM) with proposal N(y|x, σ2I), (ii) an adaptive MCMC (AM) that ﬁts a proposal of the form N(y|x, Σ) (we consider a computational efﬁcient version based on updating the Cholesky factor of Σ; see Supplement), (iii) a standard MALA proposal N(y|x + (1/2)σ2 log π(x), σ2I), (iv) an Hamiltonian Monte Carlo (HMC) with a ﬁxed number of leap frog steps (either 5, or 10, or 20) (v) and the state of the art no-U-turn sampler (NUTS) [17] which arguably is the most efﬁcient adaptive HMC method that automatically determines the number of leap frog steps. We provide our own MALTAB implementation2 of all algorithms, apart from NUTS for which we consider a publicly available implementation.

4.1 Illustrative experiments To visually illustrate the gradient-based adaptive samplers we consider a correlated 2-D Gaussian target with covariance matrix Σ = [1 0.99; 0.99 1] and a 51-dimensional Gaussian target obtained by evaluating the squared exponential kernel plus small white noise, i.e. k(xi, xj) = exp{ 1

0.16 }+ 0.01δi,j, on the regular grid [0, 4]. The ﬁrst two panels in Figure 1 show the true covariance together with the adapted covariances obtained by gad RWM for two different settings of the average acceptance rate α in Algorithm 1, which illustrates also the adaptation of the entropy-regularisation hyperparameter β that is learnt to obtain a certain α . The remaining two plots illustrate the ability to learn a highly correlated 51-dimensional covariance matrix (with eigenvalues ranging from 0.01 to 12.07) by applying our most advanced gad MALAf scheme.

4.2 Quantitative results Here, we compare all algorithms in some standard benchmark problems, such as Bayesian logistic regression, and report effective sample size (ESS) together with other quantitative scores.

Experimental settings. In all experiments for AM and gradient-based adaptive schemes the Cholesky factor L was initialised to a diagonal matrix with values 0.1/ n in the diagonal where n is the dimensionality of x. For the AM algorithm the learning rate was set to 0.001/(1+t/T) where t is the number of iterations and T (the value 4000 was used in all experiments) serves to keep the learning rate nearly constant for the ﬁrst T training iterations. For the gradient-based adaptive algorithms

2https://github.com/mtitsias/gad MCMC.

-3 -2 -1 0 1 2 3 -3

-3 -2 -1 0 1 2 3 -3

10 20 30 40 50

10 20 30 40 50

Figure 1: The green contours in the ﬁrst two panels (from left to right) show the 2-D Gaussian target, while the blue contours show the learned covariance, LL , after adapting for 2 104 iterations using gad RWM and targeting acceptance rates α = 0.25 and α = 0.4, respectively. For α = 0.25 the adapted blue contours show that the proposal matches the shape of the target but it has higher entropy/variance and the hyperparameter β obtained the value 7.4. For α = 0.4 the blue contours shrink a bit and β is reduced to 2.2 (since higher acceptance rate requires smaller entropy). The third panel shows the exact 51 51 covariance matrix and the last panel shows the adapted one, after running our most efﬁcient gad MALAf scheme for 2 105 iterations. In both experiments L was initialised to diagonal matrix with 0.1/ n in the diagonal.

0.5 1 1.5 2 104

0.5 1 1.5 2 104

0.5 1 1.5 2 104

0.5 1 1.5 2 104

4 gad MALAf

0 20 40 60 80 100 0

0 20 40 60 80 100 0

0.2 gad RWM

0 20 40 60 80 100 0

0.6 gad MALAe

0 20 40 60 80 100 0

0.8 gad MALAf

Figure 2: Panels in the ﬁrst row show trace plots, obtained by different schemes, across the last 2 104 sampling iterations for the most difﬁcult to sample x100 dimension. The panels in the second row show the estimated values of the diagonal of L obtained by different adaptive schemes. Notice that the real Gaussian target has diagonal covariance matrix Σ = diag(s2 1, . . . , s2 100) where si are uniform in the range [0.01, 1].

we use RMSprop (see Section 2.1) where η was set to 0.00005 for gad RWM and to 0.00015 for the gad MALA schemes. NUTS uses its own fully automatic adaptive procedure that determines both the step size and the number of leap frog steps [17]. For all experiments and algorithms (apart from NUTS) we consider 2 104 burn-in iterations and 2 104 iterations for collecting samples. This adaptation of L or σ2 takes place only during the burn-in iterations and then it stops, i.e. at collection of samples stage these parameters are kept ﬁxed. For NUTS, which has its own internal tuning procedure, 500 burn-in iterations are sufﬁcient before collecting 2 104 samples. The computational time for all algorithms reported in the tables corresponds to the overall running time, i.e. the time for performing jointly all burn-in and collection of samples iterations.

Neal s Gaussian target. We ﬁrst consider an example used in [24] where the target is a zero-mean multivariate Gaussian with diagonal covariance matrix Σ = diag(s2 1, . . . , s2 100) where the stds si take values in the linear grid 0.01, 0.02, . . . , 1. This is a challenging example because the different scaling of the dimensions means that the schemes that use an isotropic step σ2 will be adapted to the smallest dimension x1 while the chain at the higher dimensions, such as x100, will be moving slowly exhibiting high autocorrelation and small effective sample size. The ﬁrst row of Figure 3 shows the trace plot across iterations of the dimension x100 for some of the adaptive schemes including an HMC scheme that uses 20 leap frog steps. Clearly, the gradient-based adaptive methods show much smaller autocorrelation that AM. This is because they achieve a more efﬁcient adaptation of the Cholesky factor L which ideally should become proportional to a diagonal matrix with the linear grid 0.01, 0.02, . . . , 1 in the main diagonal. The second row of Figure 3 shows the diagonal elements of L from which we can observe that all gradient-based schemes lead to more accurate adaptation with gad MALAf being the most accurate.

Furthermore, Table 1 provides quantitative results such as minimum, median and maximum ESS computed across all dimensions of the state vector x, running times and an overall efﬁciency score

Table 1: Comparison in Neal s Gaussian example (dimensionality was n = 100; see panel above) and Caravan binary classiﬁcation dataset where the latter consists of 5822 data points (dimensionality was n = 87; see panel below). All numbers are averages across ten repeats where also one-standard deviation is given for the Min ESS/s score. From the three HMC schemes we report only the best one in each case.

Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)

(Neal s Gaussian) gad MALAf 8.7 0.556 (1413.4, 1987.4, 2580.8) 161.70 (15.07) gad MALAe 10.0 0.541 (922.2, 2006.3, 2691.1) 92.34 (7.11) gad RWM 7.0 0.254 (27.5, 66.9, 126.9) 3.95 (0.66) AM 2.3 0.257 (8.7, 48.6, 829.1) 3.71 (0.87) RWM 2.2 0.261 (2.9, 8.4, 2547.6) 1.31 (0.06) MALA 3.1 0.530 (2.9, 10.0, 12489.2) 0.95 (0.03) HMC-20 49.7 0.694 (306.1, 1537.8, 19732.4) 6.17 (3.35) NUTS 360.5 >0.7 (18479.6, 20000.0, 20000.0) 51.28 (1.64)

(Caravan) gad MALAf 23.1 0.621 (228.1, 750.3, 1114.7) 9.94 (2.64) gad MALAe 95.1 0.494 (66.6, 508.3, 1442.7) 0.70 (0.16) gad RWM 22.6 0.234 (5.3, 34.3, 104.5) 0.23 (0.06) AM 20.0 0.257 (3.2, 11.8, 62.5) 0.16 (0.01) RWM 15.3 0.242 (3.0, 9.3, 52.5) 0.20 (0.03) MALA 22.8 0.543 (4.4, 28.3, 326.0) 0.19 (0.05) HMC-10 225.5 0.711 (248.3, 2415.7, 19778.7) 1.10 (0.12) NUTS 1412.1 >0.7 (7469.5, 20000.0, 20000.0) 5.29 (0.38)

Min ESS/s (i.e. ESS for the slowest moving component of x divided by running time last column in the Table) which allows to rank the different algorithms. All results are averages after repeating the simulations 10 times under different random initialisations. From the table it is clear that the gad MALA algorithms give the best performance with gad MALAf being overall the most effective.

Bayesian logistic regression. We consider binary classiﬁcation where given a set of training examples {yi, si}n i=1 we assume a logistic regression likelihood p(y|w, s) = Pn i=1 yi log σ(si) + (1 yi) log(1 σ(si)), where σ(si) = 1/(1 + exp( w si)), si is the input vector and w the parameters. We place a Gaussian prior on w and we wish to sample from the posterior distribution over w. We considered six binary classiﬁcation datasets (Australian Credit, Heart, Pima Indian, Ripley, German Credit and Caravan) with a number of examples ranging from n = 250 to n = 5822 and dimensionality of the state/parameter vector ranging from 3 to 87. Table 1 shows results for the most challenging Caravan dataset where the dimensionality of w is 87, while the remaining ﬁve tables are given in the Supplement. From all tables we observe that the gad MALAf is the most effective and it outperforms all other methods. While NUTS has always very high ESS is still outperformed by gad MALAf because of the high computational cost, i.e. NUTS might need to use a very large number of leap frog steps (each requiring re-evaluating the gradient of the log target) per iteration. Further results, including a higher 785-dimensional example on MNIST, are given in the Supplement.

5 Conclusion

We have presented a new framework for gradient-based adaptive MCMC that makes use of an entropyregularised objective function that generalises the concept of speed measure. We have applied this method for learning RWM and MALA proposals with full covariance matrices.

Some topics for future research are the following. Firstly, to deal with very high dimensional spaces it would be useful to consider low rank parametrisations of the covariance matrices in RWM and MALA proposals. Secondly, it would be interesting to investigate whether our method can be used to tune the so-called mass matrix in HMC samplers. However, in order for this to lead to practical and scalable algorithms we have to come up with schemes that avoid the computation of the Hessian, as we successfully have done for MALA. Finally, in order to reduce the variance of the stochastic gradients and speed up further the adaptation, especially in high dimensions, our framework could be possibly combined with parallel computing as used for instance in deep reinforcement learning [12].

[1] Christophe Andrieu and Yves Atchade. On the efﬁciency of adaptive mcmc algorithms. Electronic Communications in Probability, 12:336 349, 2007.

[2] Christophe Andrieu and Éric Moulines. On the ergodicity properties of some adaptive mcmc algorithms. The Annals of Applied Probability, 16(3):1462 1505, 2006.

[3] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and Computing, 18(4):343 373, December 2008.

[4] Yves Atchade, Gersende Fort, Eric Moulines, and Pierre Priouret. Adaptive markov chain monte carlo: theory and methods. Preprint, 2009.

[5] Yves F Atchadé and Jeffrey S Rosenthal. On adaptive markov chain monte carlo algorithms. Bernoulli, 11(5):815 828, 2005.

[6] Mylène Bédard. Weak convergence of metropolis algorithms for non-iid target distributions. The Annals of Applied Probability, pages 1222 1244, 2007.

[7] Mylène Bédard. Efﬁcient sampling using metropolis algorithms: Applications of optimal scaling results. Journal of Computational and Graphical Statistics, 17(2):312 332, 2008.

[8] Mylene Bedard. Optimal acceptance rates for metropolis algorithms: Moving beyond 0.234. Stochastic Processes and their Applications, 118(12):2198 2222, 2008.

[9] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid monte carlo algorithm. Bernoulli, 19(5A):1501 1534, 2013.

[10] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[11] P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.

[12] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1407 1416, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[13] Paolo Giordani and Robert Kohn. Adaptive independent metropolis hastings by fast estimation of mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243 259, 2010.

[14] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, 7(2):223 242, 2001.

[15] Heikki Haario, Eero Saksman, and Johanna Tamminen. Componentwise adaptation for high dimensional mcmc. Computational Statistics, 20(2):265 273, 2005.

[16] Raza Habib and David Barber. Auxiliary variational mcmc. To appear at ICLR 2019, 2019.

[17] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

[18] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, USA, 1999.

[19] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

[20] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1 45, 2017.

[21] Daniel Levy, Matt D. Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. In International Conference on Learning Representations, 2018.

[22] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953.

[23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.

[24] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113 162, 2010.

[25] Kirill Neklyudov, Pavel Shvechikov, and Dmitry Vetrov. Metropolis-hastings view on variational inference and adversarial training. ar Xiv preprint ar Xiv:1810.07151, 2018.

[26] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.

[27] Cristian Pasarica and Andrew Gelman. Adaptively scaling the metropolis algorithm using expected squared jumped distance. Statistica Sinica, pages 343 364, 2010.

[28] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artiﬁcial Intelligence and Statistics, 2014.

[29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.

[30] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and optimal scaling of random walk metropolis algorithms. The annals of applied probability, 7(1):110 120, 1997.

[31] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255 268, 1998.

[32] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-hastings algorithms. Statistical Science, pages 351 367, 2001.

[33] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive markov chain monte carlo algorithms. Journal of applied probability, 44(2):458 475, 2007.

[34] Gareth O Roberts and Jeffrey S Rosenthal. Examples of adaptive mcmc. Journal of Computational and Graphical Statistics, 18(2):349 367, 2009.

[35] Jeffrey S Rosenthal. Optimal proposal distributions and adaptive mcmc. In Handbook of Markov Chain Monte Carlo, pages 114 132. Chapman and Hall/CRC, 2011.

[36] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, 2015.

[37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, 2015.

[38] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140 5150, 2017.

[39] T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 4, 2012.

[40] M. K. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.