# kernel_adaptive_metropolishastings__f5e0561d.pdf Kernel Adaptive Metropolis-Hastings Dino Sejdinovic DINO@GATSBY.UCL.AC.UK Heiko Strathmann UCABHST@GATSBY.UCL.AC.UK Maria Lomeli Garcia MLOMELI@GATSBY.UCL.AC.UK Christophe Andrieu C.ANDRIEU@BRISTOL.AC.UK Arthur Gretton ARTHUR.GRETTON@GMAIL.COM Gatsby Unit, CSML, University College London, UK and School of Mathematics, University of Bristol, UK A Kernel Adaptive Metropolis-Hastings algorithm is introduced, for the purpose of sampling from a target distribution with strongly nonlinear support. The algorithm embeds the trajectory of the Markov chain into a reproducing kernel Hilbert space (RKHS), such that the feature space covariance of the samples informs the choice of proposal. The procedure is computationally efficient and straightforward to implement, since the RKHS moves can be integrated out analytically: our proposal distribution in the original space is a normal distribution whose mean and covariance depend on where the current sample lies in the support of the target distribution, and adapts to its local covariance structure. Furthermore, the procedure requires neither gradients nor any other higher order information about the target, making it particularly attractive for contexts such as Pseudo Marginal MCMC. Kernel Adaptive Metropolis Hastings outperforms competing fixed and adaptive samplers on multivariate, highly nonlinear target distributions, arising in both real-world and synthetic examples. 1. Introduction The choice of the proposal distribution is known to be crucial for the design of Metropolis-Hastings algorithms, and methods for adapting the proposal distribution to increase the sampler s efficiency based on the history of the Markov chain have been widely studied. These methods often aim to learn the covariance structure of the target distribution, and adapt the proposal accordingly. Adaptive MCMC samplers were first studied by Haario et al. (1999; Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 2001), where the authors propose to update the proposal distribution along the sampling process. Based on the chain history, they estimate the covariance of the target distribution and construct a Gaussian proposal centered at the current chain state, with a particular choice of the scaling factor from Gelman et al. (1996). More sophisticated schemes are presented by Andrieu & Thoms (2008), e.g., adaptive scaling, component-wise scaling, and principal component updates. While these strategies are beneficial for distributions that show high anisotropy (e.g., by ensuring the proposal uses the right scaling in all principal directions), they may still suffer from low acceptance probability and slow mixing when the target distributions are strongly nonlinear, and the directions of large variance depend on the current location of the sampler in the support. In the present work, we develop an adaptive Metropolis-Hastings algorithm in which samples are mapped to a reproducing kernel Hilbert space, and the proposal distribution is chosen according to the covariance in this feature space (Sch olkopf et al., 1998; Smola et al., 2001). Unlike earlier adaptive approaches, the resulting proposal distributions are locally adaptive in input space, and oriented towards nearby regions of high density, rather than simply matching the global covariance structure of the distribution. Our approach combines a move in the feature space with a stochastic step towards the nearest input space point, where the feature space move can be analytically integrated out. Thus, the implementation of the procedure is straightforward: the proposal is simply a multivariate Gaussian in the input space, with location-dependent covariance which is informed by the feature space representation of the target. Furthermore, the resulting Metropolis-Hastings sampler only requires the ability to evaluate the unnormalized density of the target (or its unbiased estimate, as in Pseudo-Marginal MCMC of Andrieu & Roberts, 2009), and no gradient evaluation is needed, making it applicable to situations where more sophisticated schemes based on Hamiltonian Monte Carlo (HMC) or Metropolis Adjusted Langevin Algorithms Kernel Adaptive Metropolis Hastings (MALA) (Roberts & Stramer, 2003; Girolami & Calderhead, 2011) cannot be applied. We begin our presentation in Section 2, with a brief overview of existing adaptive Metropolis approaches; we also review covariance operators in the RKHS. Based on these operators, we describe a sampling strategy for Gaussian measures in the RKHS in Section 3, and introduce a cost function for constructing proposal distributions. In Section 4, we outline our main algorithm, termed Kernel Adaptive Metropolis-Hastings (MCMC Kameleon). We provide experimental comparisons with other fixed and adaptive samplers in Section 5, where we show superior performance in the context of Pseudo-Marginal MCMC for Bayesian classification, and on synthetic target distributions with highly nonlinear shape. 2. Background Adaptive Metropolis Algorithms. Let X = Rd be the domain of interest, and denote the unnormalized target density on X by π. Additionally, let Σt = Σt(x0, x1, . . . , xt 1) denote an estimate of the covariance matrix of the target density based on the chain history {xi}t 1 i=0. The original adaptive Metropolis at the current state of the chain state xt = y uses the proposal qt( |y) = N(y, ν2Σt), (1) where ν = 2.38/ d is a fixed scaling factor from Gelman et al. (1996). This choice of scaling factor was shown to be optimal (in terms of efficiency measures) for the usual Metropolis algorithm. While this optimality result does not hold for Adaptive Metropolis, it can nevertheless be used as a heuristic. Alternatively, the scale ν can also be adapted at each step as in Andrieu & Thoms (2008, Algorithm 4) to obtain the acceptance rate from Gelman et al. (1996), a = 0.234. RKHS Embeddings and Covariance Operators. According to the Moore-Aronszajn theorem (Berlinet & Thomas-Agnan, 2004, p. 19), for every symmetric, positive definite function (kernel) k : X X R, there is an associated reproducing kernel Hilbert space Hk of realvalued functions on X with reproducing kernel k. The map ϕ : X Hk, ϕ : x 7 k( , x) is called the canonical feature map of k. This feature map or embedding of a single point can be extended to that of a probability measure P on X: its kernel embedding is an element µP Hk, given by µP = k( , x) d P(x) (Berlinet & Thomas-Agnan, 2004; Fukumizu et al., 2004; Smola et al., 2007). If a measurable kernel k is bounded, it is straightforward to show using the Riesz representation theorem that the mean embedding µk(P) exists for all probability measures on X. For many interesting bounded kernels k, including the Gaussian, Laplacian and inverse multi-quadratics, the kernel embedding P 7 µP is injective. Such kernels are said to be characteristic (Sriperumbudur et al., 2010; 2011), since each distribution is uniquely characterized by its embedding (in the same way that every probability distribution has a unique characteristic function). The kernel embedding µP is the representer of expectations of smooth functions w.r.t. P, i.e., f Hk, f, µP Hk = f(x)d P(x). Given samples z = {zi}n i=1 P, the embedding of the empirical measure is µz = 1 n Pn i=1 k( , zi). Next, the covariance operator CP : Hk Hk for a probability measure P is given by CP = k( , x) k( , x) d P(x) µP µP (Baker, 1973; Fukumizu et al., 2004), where for a, b, c Hk the tensor product is defined as (a b)c = b, c Hk a. The covariance operator has the property that f, g Hk, f, CP g Hk = EP (fg) EP f EP g. Our approach is based on the idea that the nonlinear support of a target density may be learned using Kernel Principal Component Analysis (Kernel PCA) (Sch olkopf et al., 1998; Smola et al., 2001), this being linear PCA on the empirical covariance operator in the RKHS, Cz = 1 n Pn i=1 k( , zi) k( , zi) µz µz, computed on the sample z defined above. The empirical covariance operator behaves as expected: applying the tensor product definition gives f, Czg Hk = 1 n Pn i=1 f(zi)g(zi) 1 n Pn i=1 f(zi) 1 n Pn i=1 g(zi) . By analogy with algorithms which use linear PCA directions to inform M-H proposals (Andrieu & Thoms, 2008, Algorithm 8), nonlinear PCA directions can be encoded in the proposal construction, as described in Appendix C. Alternatively, one can focus on a Gaussian measure on the RKHS determined by the empirical covariance operator Cz rather than extracting its eigendirections, which is the approach we pursue in this contribution. This generalizes the proposal (1), which considers the Gaussian measure induced by the empirical covariance matrix on the original space. 3. Sampling in RKHS We next describe the proposal distribution at iteration t of the MCMC chain. We will assume that a subset of the chain history, denoted z = {zi}n i=1, n t 1, is available. Our proposal is constructed by first considering the samples in the RKHS associated to the empirical covariance operator, and then performing a gradient descent step on a cost function associated with those samples. Gaussian Measure of the Covariance Operator. We will work with the Gaussian measure on the RKHS Hk with mean k( , y) and covariance ν2Cz, where z = {zi}n i=1 is the subset of the chain history. While there is no analogue of a Lebesgue measure in an infinite dimensional RKHS, it is instructive (albeit with some abuse of notation) to denote Kernel Adaptive Metropolis Hastings this measure in the density form N(f ; k( , y), ν2Cz) 2ν2 f k( , y), C 1 z (f k( , y)) . As Cz is a finite-rank operator, this measure is supported only on a finite-dimensional affine space k( , y) + Hz, where Hz = span {k( , zi)}n i=1 is the subspace spanned by the canonical features of z. It can be shown that a sample from this measure has the form f = k( , y) + Pn i=1 βi [k( , zi) µz] , where β N(0, ν2 n I) is isotropic. Indeed, to see that f has the correct covariance structure, note that: E [(f k( , y)) (f k( , y))] j=1 βiβj (k( , zi) µz) (k( , zj) µz) i=1 (k( , zi) µz) (k( , zi) µz) = ν2Cz. Due to the equivalence in the RKHS between a Gaussian measure and a Gaussian Process (GP) (Berlinet & Thomas Agnan, 2004, Ch. 4), we can think of the RKHS samples f as trajectories of the GP with mean m(x) = k(x, y) and covariance function κ(x, x ) = cov [f(x), f(x )] i=1 (k(x, zi) µz(x)) (k(x , zi) µz(x )) . The covariance function κ of this GP is therefore the kernel k convolved with itself with respect to the empirical measure associated to the samples z, and draws from this GP therefore lie in a smaller RKHS; see Saitoh (1997, p. 21) for details. Obtaining Target Samples through Gradient Descent. We have seen how to obtain the RKHS sample f = k( , y) + Pn i=1 βi [k( , zi) µz] from the Gaussian measure in the RKHS. This sample does not in general have a corresponding pre-image in the original domain X = Rd; i.e., there is no point x X such that f = k( , x ). If there were such a point, then we could use it as a proposal in the original domain. Therefore, we are ideally looking for a point x X whose canonical feature map k( , x ) is close to f in the RKHS norm. We consider the optimization problem arg min x X k ( , x) f 2 Hk = arg min x X k(x, x) 2k(x, y) 2 i=1 βi [k(x, zi) µz(x)] In general, this is a non-convex minimization problem, and may be difficult to solve (Bakir et al., 2003). Rather than Samples {zi}200 i=1 Current position y Figure 1. Heatmaps (white denotes large) and gradients of g(x) for two samples of β and fixed z. solving it for every new vector of coefficients β, which would lead to an excessive computational burden for every proposal made, we simply make a single descent step along the gradient of the cost function, g(x) = k(x, x) 2k(x, y) 2 i=1 βi [k(x, zi) µz(x)] , (2) i.e., the proposed new point is x = y η xg(x)|x=y + ξ, where η is a gradient step size parameter and ξ N(0, γ2I) is an additional isotropic exploration term after the gradient step. It will be useful to split the scaled gradient at y into two terms as η xg(x)|x=y = η (ay Mz,y Hβ), where ay = xk(x, x)|x=y 2 xk(x, y)|x=y, Mz,y = 2 [ xk(x, z1)|x=y, . . . , xk(x, zn)|x=y] (3) is a d n matrix, and H = I 1 n1n n is the n n centering matrix. Figure 1 plots g(x) and its gradients for several samples of β-coefficients, in the case where the underlying z-samples are from the two-dimensional nonlinear Banana target distribution of Haario et al. (1999). It can be seen that g may have multiple local minima, and that it varies most along the high-density regions of the Banana distribution. 4. MCMC Kameleon Algorithm 4.1. Proposal Distribution We now have a recipe to construct a proposal that is able to adapt to the local covariance structure for the current chain Kernel Adaptive Metropolis Hastings MCMC Kameleon Input: unnormalized target π, subsample size n, scaling parameters ν, γ, adaptation probabilities {pt} t=0, kernel k, At iteration t + 1, 1. With probability pt, update a random subsample z = {zi}min(n,t) i=1 of the chain history {xi}t 1 i=0, 2. Sample proposed point x from qz( |xt) = N(xt, γ2I + ν2Mz,xt HM z,xt), where Mz,xtis given in Eq. (3) and H = I 1 n1n n is the centering matrix, 3. Accept/Reject with the Metropolis-Hastings acceptance probability A(xt, x ) in Eq. (4), ( x , w.p. A(xt, x ), xt, w.p. 1 A(xt, x ). state y. This proposal depends on a subset of the chain history z, and is denoted by qz( |y). While we will later simplify this proposal by integrating out the moves in the RKHS, it is instructive to think of the proposal generating process as: 1. Sample β N(0, ν2I) (n 1 normal of RKHS coefficients). This represents an RKHS sample f = k( , y) + Pn i=1 βi [k( , zi) µz] which is the goal of the cost function g(x). 2. Move along the gradient of g: x = y η xg(x)|x=y + ξ. This gives a proposal x |y, β N(y ηay + ηMz,y Hβ, γ2I) (d 1 normal in the original space). Our first step in the derivation of the explicit proposal density is to show that as long as k is a differentiable positive definite kernel, the term ay vanishes. Proposition 1. Let k be a differentiable positive definite kernel. Then ay = xk(x, x)|x=y 2 xk(x, y)|x=y = 0. Since ay = 0, the gradient step size η always appears together with β, so we merge η and the scale ν of the βcoefficients into a single scale parameter, and set η = 1 henceforth. Furthermore, since both p(β) and pz(x |y, β) are multivariate Gaussian densities, the proposal density qz(x |y) = p(β)pz(x |y, β)dβ can be computed analytically. We therefore get the following closed form expression for the proposal distribution. Proposition 2. qz( |y) = N(y, γ2I + ν2Mz,y HM z,y). Figure 2. 95% contours (red) of proposal distributions evaluated at a number of points, for the first two dimensions of the banana target of Haario et al. (1999). Underneath is the density heatmap, and the samples (blue) used to construct the proposals. Proofs of the above Propositions are given in Appendix A. With the derived proposal distribution, we proceed with the standard Metropolis-Hastings accept/reject scheme, where the proposed sample x is accepted with probability A(xt, x ) = min 1, π(x )qz(xt|x ) π(xt)qz(x |xt) giving rise to the MCMC Kameleon Algorithm. Note that each π(x ) and π(xt) could be replaced by their unbiased estimates without impacting the invariant distribution (Andrieu & Roberts, 2009). The constructed family of proposals encodes local structure of the target distribution, which is learned based on the subsample z. Figure 2 depicts the regions that contain 95% of the mass of the proposal distribution qz( |y) at various states y for a fixed subsample z, where the Banana target is used (details in Section 5). More examples of proposal contours can be found in Appendix B. 4.2. Properties of the Algorithm The update schedule and convergence. MCMC Kameleon requires a subsample z = {zi}n i=1 at each iteration of the algorithm, and the proposal distribution qz( |y) is updated each time a new subsample z is obtained. It is well known that a chain which keeps adapting the proposal distribution need not converge to the correct target (Andrieu & Thoms, 2008). To guarantee convergence, we introduce adaptation probabilities {pt} t=0, such that pt 0 and P t=1 pt = , and at iteration t we update the subsample z with probability pt. As adaptations occur with decreasing probability, Theorem 1 of Roberts & Rosenthal (2007) implies that the resulting algorithm is ergodic and converges to the correct target. Another straightforward way to guarantee convergence is to fix the set z = {zi}n i=1 after a burn-in phase; i.e., to stop adapting Roberts & Rosenthal (2007, Proposition 2). In this case, a burn-in phase is used to get a rough sketch of the shape of the distribution: the initial samples need not Kernel Adaptive Metropolis Hastings come from a converged or even valid MCMC chain, and it suffices to have a scheme with good exploratory properties, e.g., Welling & Teh (2011). In MCMC Kameleon, the term γ allows exploration in the initial iterations of the chain (while the subsample z is still not informative about the structure of the target) and provides regularization of the proposal covariance in cases where it might become ill-conditioned. Intuitively, a good approach to setting γ is to slowly decrease it with each adaptation, such that the learned covariance progressively dominates the proposal. Symmetry of the proposal. In Haario et al. (2001), the proposal distribution is asymptotically symmetric due to the vanishing adaptation property. Therefore, the authors compute the standard Metropolis acceptance probability. In our case, the proposal distribution is a Gaussian with mean at the current state of the chain xt = y and covariance γ2I + ν2Mz,y HM z,y, where Mz,y depends both on the current state y and a random subsample z = {zi}n i=1 of the chain history {xi}t 1 i=0. This proposal distribution is never symmetric (as covariance of the proposal always depends on the current state of the chain), and therefore we use the Metropolis-Hastings acceptance probability to reflect this. Relationship to MALA and Manifold MALA. The Metropolis Adjusted Langevin Algorithm (MALA) algorithm uses information about the gradient of the log-target density at the current chain state to construct a proposed point for the Metropolis step. Our approach does not require that the log-target density gradient be available or computable. Kernel gradients in the matrix Mz,y are easily obtained for commonly used kernels, including the Gaussian kernel (see section 4.3), for which the computational complexity is equal to evaluating the kernel itself. Moreover, while standard MALA simply shifts the mean of the proposal distribution along the gradient and then adds an isotropic exploration term, our proposal is centered at the current state, and it is the covariance structure of the proposal distribution that coerces the proposed points to belong to the high-density regions of the target. It would be straightforward to modify our approach to include a drift term along the gradient of the log-density, should such information be available, but it is unclear whether this would provide additional performance gains. Further work is required to elucidate possible connections between our approach and the use of a preconditioning matrix (Roberts & Stramer, 2003) in the MALA proposal; i.e., where the exploration term is scaled with appropriate metric tensor information, as in Riemannian manifold MALA (Girolami & Calderhead, 2011). 4.3. Examples of Covariance Structure for Standard Kernels The proposal distributions in MCMC Kameleon are dependant on the choice of the kernel k. To gain intuition re- garding their covariance structure, we give two examples below. Linear kernel. In the case of a linear kernel k(x, x ) = x x , we obtain Mz,y = 2 xx z1|x=y, . . . , xx zn|x=y = 2Z , so the proposal is given by qz( |y) = N(y, γ2I + 4ν2Z HZ); thus, the proposal simply uses the scaled empirical covariance Z HZ just like standard Adaptive Metropolis (Haario et al., 1999), with an additional isotropic exploration component, and depends on y only through the mean. Gaussian kernel. In the case of a Gaussian kernel k(x, x ) = exp x x 2 , since xk(x, x ) = 1 σ2 k(x, x )(x x), we obtain σ2 [k(y, z1)(z1 y), . . . , k(y, zn)(zn y)] . Consider how this encodes the covariance structure of the target distribution: Rij = γ2δij a=1 [k(y, za)]2 (za,i yi)(za,j yj) a =b k(y, za)k(y, zb)(za,i yi)(zb,j yj). (5) As the first two terms dominate, the previous points za which are close to the current state y (for which k(y, za) is large) have larger weights, and thus they have more influence in determining the covariance of the proposal at y. Mat ern kernel. In the Mat ern family of kernels kϑ,ρ(x, x ) = 21 ϑ Kϑ is the modified Bessel function of the second kind, we obtain a form of the covariance structure very similar to that of the Gaussain kernel. In this case, xkϑ,ρ(x, x ) = 1 2ρ2(ϑ 1)kϑ 1,ρ(x, x )(x x), so the only difference (apart from the scalings) to (5) is that the weights are now determined by a rougher kernel kϑ 1,ρ of the same family. 5. Experiments In the experiments, we compare the following samplers: (SM) Standard Metropolis with the isotropic proposal q( |y) = N(y, ν2I) and scaling ν = 2.38/ d, (AMFS) Adaptive Metropolis with a learned covariance matrix and fixed scaling ν = 2.38/ d, (AM-LS) Adaptive Metropolis with a learned covariance matrix and scaling learned to bring the acceptance rate close to α = 0.234 as described in Andrieu & Thoms (2008, Algorithm 4), and (KAMH-LS) MCMC Kameleon with the scaling ν learned Kernel Adaptive Metropolis Hastings 6 5 4 3 2 1 0 θ2 Figure 3. Dimensions 2 and 7 of the marginal hyperparameter posterior on the UCI Glass dataset in the same fashion (γ was fixed to 0.2), and which also stops adapting the proposal after the burn-in of the chain (in all experiments, we use a random subsample z of size n = 1000, and a Gaussian kernel with bandwidth selected according to the median heuristic). We consider the following nonlinear targets: (1) the posterior distribution of Gaussian Process (GP) classification hyperparameters (Filippone & Girolami, 2014) on the UCI glass dataset, and (2) the synthetic banana-shaped distribution of Haario et al. (1999) and a flower-shaped disribution concentrated on a circle with a periodic perturbation. 5.1. Pseudo-Marginal MCMC for GP Classification In the first experiment, we illustrate usefulness of the MCMC Kameleon sampler in the context of Bayesian classification with GPs (Williams & Barber, 1998). Consider the joint distribution of latent variables f, labels y (with covariate matrix X), and hyperparameters θ, given by p(f, y, θ) = p(θ)p(f|θ)p(y|f), where f|θ N(0, Kθ), with Kθ modeling the covariance between latent variables evaluated at the input covariates: (Kθ)ij = κ(xi, x j|θ) = exp 1 2 PD d=1 (xi,d x j,d)2 and θd = log ℓ2 d. We restrict our attention to the binary logistic classifier; i.e., the likelihood is given by p(yi|fi) = 1 1 exp( yifi) where yi { 1, 1}. We pursue a fully Bayesian treatment, and estimate the posterior of the hyperparameters θ. As observed by Murray & Adams (2012), a Gibbs sampler on p(θ, f|y), which samples from p(f|θ, y) and p(θ|f, y) in turn, is problematic, as p(θ|f, y) is extremely sharp, drastically limiting the amount that any Markov chain can update θ|f, y. On the other hand, if we directly consider the marginal posterior p(θ|y) p(y|θ)p(θ) of the hyperparameters, a much less peaked distribution can be obtained. However, the marginal likelihood p(y|θ) is intractable for non-Gaussian likelihoods p(y|f), so it is not possible to analytically integrate out the latent variables. Recently developed pseudomarginal MCMC methods (Andrieu & Roberts, 2009) en- able exact inference on this problem (Filippone & Girolami, 2014), by replacing p(y|θ) with an unbiased estimate ˆp(y|θ) := 1 nimp i=1 p(y|f (i))p(f (i)|θ) q(f (i)|θ), (6) where f (i) nimp i=1 q(f|θ) are nimp importance samples. In Filippone & Girolami (2014), the importance distribution q(f|θ) is chosen as the Laplacian or as the Expectation Propagation (EP) approximation of p(f|y, θ) p(y|f)p(f|θ), leading to state-of-the-art results. We consider the UCI Glass dataset (Bache & Lichman, 2013), where classification of window against non-window glass is sought. Due to the heterogeneous structure of each of the classes (i.e., non-window glass consists of containers, tableware and headlamps), there is no single consistent set of lengthscales determining the decision boundary, so one expects the posterior of the covariance bandwidths θd to have a complicated (nonlinear) shape. This is illustrated by the plot of the posterior projections to the dimensions 2 and 7 (out of 9) in Figure 3. Since the ground truth for the hyperparameter posterior is not available, we initially ran 30 Standard Metropolis chains for 500,000 iterations (with a 100,000 burn-in), kept every 1000-th sample in each of the chains, and combined them. The resulting samples were used as a benchmark, to evaluate the performance of shorter single-chain runs of SM, AM-FS, AM-LS and KAMH-LS. Each of these algorithms was run for 100,000 iterations (with a 20,000 burnin) and every 20th sample was kept. Two metrics were used in evaluating the performance of the four samplers, relative to the largescale benchmark. First, the distance ˆµθ µb θ 2 was computed between the mean ˆµθ estimated from each of the four sampler outputs, and the mean µb θ on the benchmark sample (Fig. 4, left), as a function of sample size. Second, the MMD (Borgwardt et al., 2006; Gretton et al., 2007) was computed between each sampler output and the benchmark sample, using the polynomial kernel (1 + θ, θ )3; i.e., the comparison was made in terms of all mixed moments of order up to 3 (Fig. 4, right). The figures indicate that KAMH-LS approximates the benchmark sample better than the competing approaches, where the effect is especially pronounced in the high order moments, indicating that KAMH-LS thoroughly explores the distribution support in a relatively small number of samples. We emphasise that, as for any pseudo-marginal MCMC scheme, neither the likelihood itself, nor any higher-order information about the marginal posterior target p(θ|y), are available. This makes HMC or MALA based approaches such as (Roberts & Stramer, 2003; Girolami & Calderhead, 2011) unsuitable for this problem, so it is very difficult to deal with strongly nonlinear posterior targets. In contrast, as indicated in this example, the MCMC Kameleon scheme is able to effectively sample from such nonlinear targets, Kernel Adaptive Metropolis Hastings 1000 2000 3000 4000 5000 number of samples mean distance ˆµθ µb θ 2 KAMH-LS AM-LS AM-FS SM 1000 2000 3000 4000 5000 number of samples MMD from the benchmark sample KAMH-LS AM-LS AM-FS SM Figure 4. The comparison of SM, AM-FS, AM-LS and KAMH-LS in terms of the distance between the estimated mean and the mean on the benchmark sample (left) and in terms of the maximum mean discrepancy to the benchmark sample (right). The results are averaged over 30 chains for each sampler. Error bars represent 80%-confidence intervals. Moderately twisted 8-dimensional B(0.03, 100) target; iterations: 40000, burn-in: 20000 Accept [0, 1] ||ˆE[X]||/10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 SM AM-FS AM-LS KAMH-LS Strongly twisted 8-dimensional B(0.1, 100) target; iterations: 80000, burn-in: 40000 Accept [0, 1] ||ˆE[X]||/10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 SM AM-FS AM-LS KAMH-LS 8-dimensional F(10, 6, 6, 1) target; iterations: 120000, burn-in: 60000 Accept [0, 1] ||ˆE[X]||/10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SM AM-FS AM-LS KAMH-LS Figure 5. Results for three nonlinear targets, averaged over 20 chains for each sampler. Accept is the acceptance rate scaled to the interval [0, 1]. The norm of the mean ||ˆE[X]|| is scaled by 1/10 to fit into the figure scalling, and the bars over the 0.1, . . . , 0.9-quantiles represent the deviation from the exact quantiles, scaled by 10; i.e., 0.1 corresponds to 1% deviation. Error bars represent 80%-confidence intervals. Kernel Adaptive Metropolis Hastings and outperforms the vanilla Metropolis methods, which are the only competing choices in the pseudo-marginal context. In addition, since the bulk of the cost for pseudo-marginal MCMC is in importance sampling in order to obtain the acceptance ratio, the additional cost imposed by KAMHLS is negligible. Indeed, we observed that there is an increase of only 2-3% in terms of effective computation time in comparison to all other samplers, for the chosen size of the chain history subsample (n = 1000). 5.2. Synthetic examples Banana target. In Haario et al. (1999), the following family of nonlinear target distributions is considered. Let X N(0, Σ) be a multivariate normal in d 2 dimensions, with Σ = diag(v, 1, . . . , 1), which undergoes the transformation X Y , where Y2 = X2 + b(X2 1 v), and Yi = Xi for i = 2. We will write Y B(b, v). It is clear that EY = 0, and that B(y; b, v) = N(y1; 0, v)N(y2; b(y2 1 v), 1) j=3 N(yj; 0, 1). Flower target. The second target distribution we consider is the d-dimensional flower target F(r0, A, ω, σ), with F(x; r0, A, ω, σ) = x2 1 + x2 2 r0 A cos (ωatan2 (x2, x1)) j=3 N(xj; 0, 1). This distribution concentrates around the r0-circle with a periodic perturbation (with amplitude A and frequency ω) in the first two dimensions. In these examples, exact quantile regions of the targets can be computed analytically, so we can directly assess performance without the need to estimate distribution distances on the basis of samples (i.e., by estimating MMD to the benchmark sample). We compute the following measures of performance (similarly as in Haario et al. (1999); Andrieu & Thoms (2008)) based on the chain after burn-in: average acceptance rate, norm of the empirical mean (the true mean is by construction zero for all targets), and the deviation of the empirical quantiles from the true quantiles. We consider 8-dimensional target distributions: the moderately twisted B(0.03, 100) banana target (Figure 5, top) and the strongly twisted B(0.1, 100) banana target (Figure 5, middle) and F(10, 6, 6, 1) flower target (Figure 5, bottom). The results show that MCMC Kameleon is superior to the competing samplers. Since the covariance of the proposal adapts to the local structure of the target at the current chain state, as illustrated in Figure 2, MCMC Kameleon does not suffer from wrongly scaled proposal distributions. The result is a significantly improved quantile performance in comparison to all competing samplers, as well as a comparable or superior norm of the empirical mean. SM has a significantly larger norm of the empirical mean, due to its purely random walk behavior (e.g., the chain tends to get stuck in one part of the space, and is not able to traverse both tails of the banana target equally well). AM with fixed scale has a low acceptance rate (indicating that the scaling of the proposal is too large), and even though the norm of the empirical mean is much closer to the true value, quantile performance of the chain is poor. Even if the estimated covariance matrix closely resembles the true global covariance matrix of the target, using it to construct proposal distributions at every state of the chain may not be the best choice. For example, AM correctly captures scalings along individual dimensions for the flower target (the norm of its empirical mean is close to its true value of zero) but fails to capture local dependence structure. The flower target, due to its symmetry, has an isotropic covariance in the first two dimensions even though they are highly dependent. This leads to a mismatch in the scale of the covariance and the scale of the target, which concentrates on a thin band in the joint space. AM-LS has the correct acceptance rate, but the quantile performance is even worse, as the scaling now becomes too small to traverse high-density regions of the target. 6. Conclusions We have constructed a simple, versatile, adaptive, gradientfree MCMC sampler that constructs a family of proposal distributions based on the sample history of the chain. These proposal distributions automatically conform to the local covariance structure of the target distribution at the current chain state. In experiments, the sampler outperforms existing approaches on nonlinear target distributions, both by exploring the entire support of these distributions, and by returning accurate empirical quantiles, indicating faster mixing. Possible extensions include incorporating additional parametric information about the target densities, and exploring the tradeoff between the degree of subsampling of the chain history and convergence of the sampler. Software. Python implementation of MCMC Kameleon is available at https://github.com/karlnapf/ kameleon-mcmc. Acknowledgments. D.S., H.S., M.L.G. and A.G. acknowledge support of the Gatsby Charitable Foundation. We thank Mark Girolami for insightful discussions and the anonymous reviewers for useful comments. Kernel Adaptive Metropolis Hastings Andrieu, C. and Roberts, G.O. The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist., 37(2): 697 725, 2009. Andrieu, C. and Thoms, J. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343 373, 2008. Bache, K. and Lichman, M. UCI Machine Learning Repository, 2013. URL http://archive.ics.uci.edu/ml. Baker, C. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273 289, 1973. Bakir, G., Weston, J., and Sch olkopf, B. Learning to find preimages. In Advances in Neural Information Processing Systems 16. MIT Press, 2003. Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer, 2004. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Sch olkopf, B., and Smola, A. J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics (ISMB), 22(14):e49 e57, 2006. Filippone, M. and Girolami, M. Pseudo-marginal Bayesian inference for Gaussian Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014. doi: TPAMI.2014. 2316530. Fukumizu, K., Bach, F. R., and Jordan, M. I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res., 5:73 99, 2004. Gelman, A., Roberts, G. O., and Gilks, W. R. Efficient Metropolis jumping rules. In Bayesian statistics, 5 (Alicante, 1994), Oxford Sci. Publ., pp. 599 607. 1996. Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 73(2):123 214, 2011. Gretton, A., Borgwardt, K., Rasch, M., Sch olkopf, B., and Smola, A. A kernel method for the two-sample problem. In Advances in Neural Information Processing Systems 19, pp. 513 520, 2007. Haario, H., Saksman, E., and Tamminen, J. Adaptive Proposal Distribution for Random Walk Metropolis Algorithm. Comput. Stat., 14(3):375 395, 1999. Haario, H., Saksman, E., and Tamminen, J. An adaptive Metropolis algorithm. Bernoulli, 7(2):223 242, 2001. Murray, I. and Adams, R.P. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems 23. 2012. Roberts, G.O. and Rosenthal, J.S. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Probab., 44(2):458 475, 03 2007. Roberts, G.O. and Stramer, O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab., 4:337 358, 2003. Saitoh, S. Integral transforms, reproducing kernels, and their applications. Pitman Research Notes in Mathematics 369, Longman Scientific & Techn., 1997. Sch olkopf, B., Smola, A. J., and M uller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10:1299 1319, 1998. Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In Proceedings of the Conference on Algorithmic Learning Theory (ALT), pp. 13 31. Springer, 2007. Smola, A. J., Mika, S., Sch olkopf, B., and Williamson, R. C. Regularized principal manifolds. J. Mach. Learn. Res., 1:179 209, 2001. Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Sch olkopf, B. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517 1561, 2010. Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res., 12:2389 2410, 2011. Steinwart, I. and Christmann, A. Support Vector Machines. Springer, 2008. Welling, M. and Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proc. of the 28th International Conference on Machine Learning (ICML), pp. 681 688, 2011. Williams, C.K.I. and Barber, D. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342 1351, 1998.