# logconcave_sampling_metropolishastings_algorithms_are_fast__c254a64e.pdf

Journal of Machine Learning Research 20 (2019) 1-42 Submitted 4/19; Revised 11/19; Published 12/19

Log-concave sampling: Metropolis-Hastings algorithms are fast

Raaz Dwivedi , raaz.rsk@berkeley.edu Yuansi Chen , yuansi.chen@berkeley.edu Martin J. Wainwright , , wainwrig@berkeley.edu Bin Yu , binyu@berkeley.edu Department of Statistics

Department of Electrical Engineering and Computer Sciences

University of California, Berkeley Voleon Group , Berkeley

Editor: Suvrit Sra

We study the problem of sampling from a strongly log-concave density supported on Rd, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diﬀusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most δ for a density with condition number κ, we show that MALA requires O κd log(1/δ) steps from a warm start, as compared to the O κ2d/δ2

steps established in past work on ULA. We also demonstrate the gains of a modiﬁed version of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for the Metropolized random walk (MRW) and obtain O(κ) mixing time slower than MALA. We provide numerical examples that support our theoretical ﬁndings, and demonstrate the beneﬁts of Metropolis-Hastings adjustment for Langevin-type sampling algorithms. Keywords: Log-concave sampling, Langevin algorithms, MCMC algorithms, conductance methods

1. Introduction

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. In statistics, these methods are useful for both estimation and inference. Under the frequentist inference framework, samples drawn from a suitable distribution can form conﬁdence intervals for a point estimate, such as those obtained by maximum likelihood. Sampling procedures are also standard in the Bayesian setting, used for exploring posterior distributions, obtaining credible intervals, and solving

. *Raaz Dwivedi and Yuansi Chen contributed equally to this work.

c 2019 Dwivedi, Chen, Wainwright and Yu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/19-306.html.

Dwivedi, Chen, Wainwright and Yu

inverse problems. Estimating the mean, posterior mean in a Bayesian setting, expectations of desired quantities, probabilities of rare events and volumes of particular sets are settings in which Monte Carlo estimates are commonly used. Recent decades have witnessed great success of Markov Chain Monte Carlo (MCMC) algorithms in generating random samples; for instance, see the handbook by Brooks et al. (2011) and references therein. In a broad sense, these methods are based on two steps. The ﬁrst step is to construct a Markov chain whose stationary distribution is either equal to the target distribution or close to it in a suitable metric. Given this chain, the second step is to draw samples by simulating the chain for a certain number of steps. Many algorithms have been proposed and studied for sampling from probability distributions with a density on a continuous state space. Two broad categories of these methods are zeroth-order methods and ﬁrst-order methods. On one hand, a zeroth-order method is based on querying the density of the distribution (up to a proportionality constant) at a point in each iteration. By contrast, a ﬁrst-order method makes use of additional gradient information about the density. A few popular examples of zeroth-order algorithms include Metropolized random walk (MRW) (Mengersen et al., 1996; Roberts and Tweedie, 1996b), Ball Walk (Lov asz and Simonovits, 1990; Dyer et al., 1991; Lov asz and Simonovits, 1993) and the Hit-and-run algorithm (B elisle et al., 1993; Kannan et al., 1995; Lov asz, 1999; Lov asz and Vempala, 2006, 2007). A number of ﬁrst-order methods are based on the Langevin diﬀusion. Algorithms related to the Langevin diﬀusion include the Metropolis adjusted Langevin Algorithm (MALA) (Roberts and Tweedie, 1996a; Roberts and Stramer, 2002; Bou-Rabee and Hairer, 2012), the unadjusted Langevin algorithm (ULA) (Parisi, 1981; Grenander and Miller, 1994; Roberts and Tweedie, 1996a; Dalalyan, 2016), underdamped (kinetic) Langevin MCMC (Cheng et al., 2018; Eberle et al., 2019), Riemannian MALA (Xifara et al., 2014), Proximal-MALA (Pereyra, 2016; Durmus et al., 2018), Metropolis adjusted Langevin truncated algorithm (Roberts and Tweedie, 1996a), Hamiltonian Monte Carlo (Neal, 2011) and Projected ULA (Bubeck et al., 2018). There is now a rich body of work on these methods, and we do not attempt to provide a comprehensive summary in this paper. More details can be found in the survey by Roberts et al. (2004), which covers MCMC algorithms for general distributions, and the survey by Vempala (2005), which focuses on random walks for compactly supported distributions. In this paper, we study sampling algorithms for sampling from a log-concave distribution Π equipped with a density. The density of log-concave distribution can be written in the form

π (x) = e f(x) Z

Rd e f(y)dy for all x Rd, (1)

where f is a convex function on Rd. Up to an additive constant, the function f corresponds to the log-likelihood deﬁned by the density. Standard examples of log-concave distributions include the normal distribution, exponential distribution and Laplace distribution. Some recent work has provided non-asymptotic bounds on the mixing times of Langevin type algorithms for sampling from a log-concave density. The mixing time corresponds to the number of steps, as A function of both the problem dimension d and the error tolerance δ, to obtain a sample from a distribution that is δ-close to the target distribution in total variation

Metropolis-Hastings algorithms are fast

distance or other distribution distances. It is known that both the ULA updates (see, e.g., Dalalyan, 2016; Durmus et al., 2019; Cheng and Bartlett, 2018) as well as underdamped Langevin MCMC (Cheng et al., 2018) have mixing times that scale polynomially in the dimension d, as well the inverse of the error tolerance 1/δ. Both the ULA and underdamped-Langevin MCMC methods are based on evaluations of the gradient f, along with the addition of Gaussian noise. Durmus et al. (2019) shows that for an appropriate decaying step size schedule, the ULA algorithm converges to the correct stationary distribution. However, their results, albeit non-asymptotic, are hard to quantify. In the sequel, we limit our discussion to Langevin algorithms based on constant step sizes, for which there are a number of explicit quantitative bounds on the mixing time. When one uses a ﬁxed step size for these algorithms, an important issue is that the resulting random walks are asymptotically biased: due to the lack of Metropolis-Hastings correction step, the algorithms will not converge to the stationary distribution if run for a large number of steps. Furthermore, if the step size is not chosen carefully the chains may become transient (Roberts and Tweedie, 1996a). Thus, the typical theory is based on running such a chain for a pre-speciﬁed number of steps, depending on the tolerance, dimension and other problem parameters. In contrast, the Metropolis-Hastings step that underlies the MALA algorithm ensures that the resulting random walk has the correct stationary distribution. Roberts and Tweedie (1996a) derived suﬃcient conditions for exponential convergence of the Langevin diﬀusion and its discretizations, with and without Metropolis-adjustment. However, they considered the distributions with f(x) = x α 2 and proved geometric convergence of ULA and MALA under some speciﬁc conditions. In a more general setting, Bou-Rabee and Hairer (2012) derived non-asymptotic mixing time bounds for MALA. However, all these bounds are non-explicit, and so makes it diﬃcult to extract an explicit dependency in terms of the dimension d and error tolerance δ. A precise characterization of this dependence is needed if one wants to make quantitative comparisons with other algorithms, including ULA and other Langevin-type schemes. Eberle (2014) derived mixing time bounds for MALA albeit in a more restricted setting compared to the one considered in this paper. In particular, Eberle s convergence guarantees are in terms of a modiﬁed Wasserstein distance, truncated so as to be upper bounded by a constant, for a subset of strongly concave measures which are four-times continuously diﬀerentiable and satisfy certain bounds on the derivatives up to order four. With this context, one of the main contributions of our paper is to provide an explicit upper bound on the mixing time bounds in total variation distance of the MALA algorithm for general log-concave distributions.

Our contributions: This paper contains two main results, both having to do with the upper bounds on mixing times of MCMC methods for sampling. As described above, our ﬁrst and primary contribution is an explicit analysis of the mixing time of Metropolis adjusted Langevin Algorithm (MALA). A second contribution is to use similar techniques to analyze a zeroth-order method called Metropolized random walk (MRW) and derive an explicit non-asymptotic mixing time bound for it. Unlike the ULA, these methods make use of the Metropolis-Hastings accept-reject step and consequently converge to the target distributions in the limit of inﬁnite steps. Here we provide explicit non-asymptotic mixing time bounds for MALA and MRW, thereby showing that MALA converges signiﬁcantly

Dwivedi, Chen, Wainwright and Yu

faster than ULA, at least in terms of the best known upper bounds on their respective mixing times.1 In particular, we show that if the density is strongly log-concave and smooth, the δ-mixing time for MALA scales as κd log(1/δ) which is signiﬁcantly faster than ULA s convergence rate of order κ2d/δ2. On the other hand, Moreover, we also show that MRW mixes O (κ) slower when compared to MALA. Furthermore, if the density is weakly logconcave, we show that (a modiﬁed version of) MALA converges in O d2/δ1.5 time in comparison to the O d3/δ4 mixing time for ULA. As alluded to earlier, such a speed-up for MALA is possible since we can choose a large step size for it which in turn is possible due to its unbiasedness in the limit of inﬁnite steps. In contrast, for ULA the step-size has to be small enough to control the bias of the distribution of the ULA iterates in the limit of inﬁnite steps, leading to a relative slow down when compared to MALA. The remainder of the paper is organized as follows. In Section 2, we provide background on a suite of MCMC sampling algorithms based on the Langevin diﬀusion. Section 3 is devoted to the statement of our mixing time bounds for MALA and MRW, along with a discussion of some consequences of these results. Section 4 is devoted to some numerical experiments to illustrate our guarantees. We provide the proofs of our main results in Section 5, with certain more technical arguments deferred to the appendices. We conclude with a discussion in Section 6.

Notation: For two sequences aϵ and bϵ indexed by a scalar ϵ I R, we say that aϵ = O (bϵ) if there exists a universal constant c > 0 such that aϵ cbϵ for all ϵ I. The Euclidean norm of a vector x Rd is denoted by x 2. The Euclidean ball with center x and radius r is denoted by B(x, r). For two distributions P1 and P2 deﬁned on the space (Rd, B(Rd)) where B(Rd) denotes the Borel-sigma algebra on Rd, we use P1 P2 TV to denote their total variation distance given by

P1 P2 TV = sup A B(Rd) |P1(A) P2(A)| .

Furthermore, KL(P1 P2) denotes their Kullback-Leibler (KL) divergence. We use Π to denote the target distribution with density π.

2. Background and problem set-up

In this section, we brieﬂy describe the general framework for MCMC algorithms and review the rates of convergence of existing random walks for log-concave distributions.

2.1. Markov chains and mixing

Here we consider the task of drawing samples from a target distribution Π with density π. A popular class of methods are based on setting up of an irreducible and aperiodic discrete-time Markov chain whose stationary distribution is equal to or close to the target distribution Π in certain metric, e.g., total variation (TV) norm. In order to obtain a δ-accurate sample, one simulates the Markov chain for a certain number of steps k, as determined by a mixing time analysis.

1. Throughout the paper, we make comparisons between sampling algorithms based on known upper bounds on respective mixing times; obtaining matching lower bounds is also of interest.

Metropolis-Hastings algorithms are fast

Going forward, we assume familiarity of the reader with a basic background in Markov chains. See, e.g., Chapters 9-12 in the standard reference textbook (Meyn and Tweedie, 2012) for a rigorous and detailed introduction. For a more rapid introduction to the basics of continuous state-space Markov chains, we refer the reader to the expository paper (Diaconis and Freedman, 1997) or Section 1 and 2 of the papers (Lov asz and Simonovits, 1993; Vempala, 2005). We now brieﬂy describe a certain class of Markov chains that are of Metropolis-Hastings type (Metropolis et al., 1953; Hastings, 1970). See the books (Robert, 2004; Brooks et al., 2011) and references therein for further background on these chains. Starting at a given initial density π0 over Rd, any such Markov chain is simulated in two steps: (1) a proposal step, and (2) an accept-reject step. For the proposal step, we make use of a proposal function p : Rd Rd R+, where p(x, ) is a density function for each x Rd. At each iteration, given a current state x Rd of the chain, the algorithm proposes a new vector z Rd by sampling from the proposal density p(x, ). In the second step, the algorithm accepts z Rd as the new state of the Markov chain with probability

α(x, z) := min 1, π(z)p(z, x)

π(x)p(x, z)

Otherwise, with probability equal to 1 α(x, z), the chain stays at x. Consequently, the overall transition kernel q for the Markov chain is deﬁned by the function

q(x, z) := p(x, z)α(x, z) for z = x,

and a probability mass at x with weight 1 R

X q(x, z)dz. The purpose of the Metropolis Hastings correction (2) is to ensure that the target density π is stationary for the Markov chain. Overall, this set-up deﬁnes an operator Tp on the space of probability distributions: given the distribution µk of the chain at time k, the distribution at time k + 1 is given by Tp(µk). In fact, with the starting distribution µ0, the distribution of the chain at kth step is given by T k p (µ0). Note that in this notation, the transition distribution at any state x is given by Tp(δx) where δx denotes the dirac-delta distribution at x. Our assumptions and set-up ensure that the chain converges to target distribution in the limit of inﬁnite steps, i.e., limk T k p (µ0) = Π. However, a more practical notion of convergence is how many steps of the chain suﬃce to ensure that the distribution of the chain is close to the target Π. In order to quantify the closeness, for a given tolerance parameter δ (0, 1) and initial distribution µ0, we deﬁne the δ-mixing time as

tmix(δ; µ0) := min n k | T k p (µ0) Π TV δ o , (3)

corresponding to the minimum number of steps that the chain takes to reach within δ in TV-norm of the target distribution, given that it starts with distribution µ0.

2.2. Sampling from log-concave distributions

Given the set-up in the previous subsection, we now describe several algorithms for sampling from log-concave distributions. Let Px denote the proposal distribution at x corresponding to the proposal density p(x, ). Possible choices of this proposal function include:

Dwivedi, Chen, Wainwright and Yu

Independence sampler: the proposal distribution does not depend on the current state of the chain, e.g., rejection sampling or when Px = N(0, Σ), where Σ is a hyperparameter.

Random walk: the proposal function satisﬁes p(x, y) = q(y x) for some probability density q, e.g., when Px = N(x, 2h Id) where h is a hyper-parameter.

Langevin algorithm: the proposal distribution is shaped according to the target distribution and is given by Px = N(x h f(x), 2h Id), where h is chosen suitably.

Symmetric Metropolis algorithms: the proposal function p satisﬁes p(x, y) = p(y, x). Some examples are Ball Walk (Frieze et al., 1994), and Hit-and-run (Lov asz, 1999).

Naturally, the convergence rate of these algorithms depends on the properties of the target density π, and the degree to which the proposal function p is suited for the task at hand. A key diﬀerence between Langevin algorithm and other algorithms is that the former makes use of ﬁrst-order (gradient) information about the target distribution Π. We now brieﬂy discuss the existing theoretical results about the convergence rate of diﬀerent MCMC algorithms. Several results on MCMC algorithms have focused on on establishing behavior and convergence of these sampling algorithms in an asymptotic or a non-explicit sense, e.g., geometric and uniform ergodicity, asymptotic variance, and central limit theorems. For more details, we refer the readers to the papers by Talay and Tubaro (1990); Meyn and Tweedie (1994); Roberts and Tweedie (1996b,a); Jarner and Hansen (2000); Roberts and Rosenthal (2001); Roberts and Stramer (2002); Pillai et al. (2012); Roberts and Rosenthal (2014), the survey by Roberts et al. (2004) and the references therein. Such results, albeit helpful for gaining insight, do not provide user-friendly rates of convergence. In other words, from these results, it is not easy to determine the computational complexity of various MCMC algorithms as a function of the problem dimension d and desired accuracy δ. Explicit nonasymptotic convergence bounds, which provide useful information for practice, are the focus of this work. We discuss the results of such type and the Langevin algorithm in more detail in Section 2.2.2. We begin with the Metropolized random walk.

2.2.1. Metropolized random walk

Roberts and Tweedie (1996b) established suﬃcient conditions on the proposal function p and the target distribution Π for the geometric convergence of several random walk Metropolis Hastings algorithms. In Section 3, we establish non-asymptotic convergence rate for the Metropolized random walk, which is based on Gaussian proposals. That is when the chain is at state xk, a proposal is drawn as follows

zk+1 = xk +

2h ξk+1, (4)

where the noise term ξk+1 N(0, Id) is independent of all past iterates. The chain then makes the transition according to an accept-reject step with respect to Π. Since the proposal distribution is symmetric, this step can be described as

zk+1 with probability min 1, π(zk+1)

xk otherwise.

Metropolis-Hastings algorithms are fast

This sampling algorithm is an instance of a zeroth-order method since it makes use of only the function values of the density π. We refer to this algorithm as MRW in the sequel. It is easy to see that the chain has a positive density of jumping from any state x to y in Rd

and hence is strongly Π-irreducible and aperiodic. Consequently, Theorem 1 by Diaconis and Freedman (1997) implies that the chain has a unique stationary distribution Π and converges to as the number of steps increases to inﬁnity. Note that this algorithm has also been referred to as Random walk Metropolized (RWM) and Random walk Metropolis Hastings (RWMH) in the literature.

2.2.2. Langevin diffusion and related sampling algorithms

Langevin-type algorithms are based on Langevin diﬀusion, a stochastic process whose evolution is characterized by the stochastic diﬀerential equation (SDE):

d Xt = f(Xt)dt +

2 d Wt, (5)

where {Wt | t 0} is the standard Brownian motion on Rd. Under fairly mild conditions on f, it is known that the diﬀusion (5) has a unique strong solution {Xt, t 0} that is a Markov process (Roberts and Tweedie, 1996a; Meyn and Tweedie, 2012). Furthermore, it can be shown that the distribution of Xt converges as t + to the invariant distribution Π characterized by the density π(x) exp( f(x)).

Unadjusted Langevin algorithm A natural way to simulate the Langevin diﬀusion (5) is to consider its forward Euler discretization, given by

xk+1 = xk h f(xk) +

2hξk+1, (6)

where the driving noise ξk+1 N(0, Id) is drawn independently at each time step. The use of iterates deﬁned by equation (6) can be traced back at least to Parisi (1981) for computing correlations; this use was noted by Besag in his commentary on the paper by Grenander and Miller (1994). However, even when the SDE is well behaved, the iterates deﬁned by this discretization have mixed behavior. For suﬃciently large step sizes h, the distribution of the iterates deﬁned by equation (6) converges to a stationary distribution that is no longer equal to Π. In fact, Roberts and Tweedie (1996a) showed that if the step size h is not chosen carefully, then the Markov chain deﬁned by equation (6) can become transient and have no stationary distribution. However, in a series of recent works (Dalalyan, 2016; Durmus et al., 2019; Cheng and Bartlett, 2018), it has been established that with a careful choice of step-size h and iteration count K, running the chain (6) for exactly K steps yields an iterate x K whose distribution is close to Π. This more recent body of work provides nonasymptotic bounds that explicitly quantify the rate of convergence for this chain. Note that the algorithm (6) does not belong to the class of Metropolis-Hastings algorithms since it does not involve an accept-reject step and does not have the target distribution Π as its stationary distribution. Consequently, in the literature, this algorithm is referred to as the unadjusted Langevin Algorithm, or ULA for short.

Metropolis adjusted Langevin algorithm An alternative approach to handling the discretization error is to adopt N(xk h f(xk), 2h Id) as the proposal distribution, and perform the Metropolis-Hastings accept-reject step. Doing so leads to the Metropolis-adjusted

Dwivedi, Chen, Wainwright and Yu

Langevin Algorithm, or MALA for short. We describe the diﬀerent steps of MALA in Algorithm 1. As mentioned earlier, the Metropolis-Hastings correction ensures that the distribution of the MALA iterates {xk} converges to the correct distribution Π as k . Indeed, since at each step the chain can reach any state x Rd, it is strongly Π-irreducible and thereby ergodic (Meyn and Tweedie, 2012; Diaconis and Freedman, 1997). Both MALA and ULA are instances of ﬁrst-order sampling methods since they make use of both the function and the gradient values of f at diﬀerent points. A natural question is if employing the accept-reject step for the discretization (6) provides any gain in the convergence rate. Our analysis to follow answers this question in the aﬃrmative.

Algorithm 1: Metropolis adjusted Langevin algorithm (MALA)

Input: Step size h and a sample x0 from a starting distribution µ0 Output: Sequence x1, x2, . . .

1 for i = 0, 1, . . . do

2 zi+1 N(xi h f(xi), 2h Id) % propose a new state

3 αi+1 = min

1, exp f(zi+1) xi zi+1 + h f(zi+1) 2 2 /4h

exp f(xi) zi+1 xi + h f(xi) 2 2 /4h

4 Ui+1 U[0, 1]

5 if Ui+1 αi+1 then xi+1 zi+1 % accept the proposal

6 else xi+1 xi % reject the proposal

2.3. Problem set-up

We study MALA and MRW and contrast their performance with existing algorithms for the case when the negative log density f(x) := log π(x) is smooth and strongly convex. A function f is said to be L-smooth if

f(y) f(x) f(x) (y x) L

2 x y 2 2 for all x, y Rd. (7a)

In the other direction, a convex function f is said to be m-strongly convex if 2

f(y) f(x) f(x) (y x) m

2 x y 2 2 for all x, y Rd. (7b)

The rates derived in this paper apply to log-concave3 distributions (1) such that f is continuously diﬀerentiable on Rd, and is both L-smooth and m-strongly convex. For such a function f, its condition number κ is deﬁned as κ := L/m. We also refer to κ as the condition number of the target distribution Π. We summarize the mixing time bounds of several sampling algorithms in Tables 1 and 2, as a function of the dimension d, the errortolerance δ, and the condition number κ. In Table 1, we state the results when the chain

2. See Appendix A for a statement of some well-known properties of smooth and strongly convex functions. 3. While our techniques can yield sharper guarantees under more idealized assumptions, e.g., like when f is Lipschitz (bounded gradients) and the target distribution satisﬁes an isoperimetry inequality, here we focus on deriving explicit guarantees with log-concave distributions.

Metropolis-Hastings algorithms are fast

Random walk Strongly log-concave Weakly log-concave

ULA (Cheng and Bartlett, 2018) O dκ2 log((log β)/δ)

ULA (Dalalyan, 2016) O dκ2 log2(β/δ)

MRW (this work) O dκ2 log β

MALA (this work) O max dκ, d0.5κ1.5 log β

Table 1. Scalings of upper bounds on δ-mixing time for diﬀerent random walks in Rd with target π e f. In the second column, we consider smooth and strongly log-concave densities, and report the bounds from a β-warm start for densities such that m Id 2f(x) LId for any x Rd and use κ := L/m to denote the condition number of the density. The big-O notation hides universal constants. We remark that the presented bounds for ULA in this column are not stated in the corresponding papers, and are derived by us, using their framework. In the last column, we summarize the scaling for weakly log-concave smooth densities: 0 2f(x) LId for all x Rd. For this case, the O notation is used to track scaling only with respect to d, δ and L and ignore dependence on the starting distribution and a few other parameters.

Random walk Distribution µ tmix(δ; µ )

ULA (Cheng and Bartlett, 2018) N(x , m 1Id) O dκ2 log(dκ/δ)

ULA (Dalalyan, 2016) N(x , L 1Id) O (d3 + d log2(1/δ))κ2

MRW (this work) N(x , L 1Id) O d2κ2 log1.5 κ

MALA (this work) N(x , L 1Id) O d2κ log κ

Table 2. Scalings of upper bounds on δ-mixing time, from the starting distribution µ given in column two, for diﬀerent random walks in Rd with target π e f such that m Id 2f(x) LId for any x Rd and κ := L/m. Here x denotes the unique mode of the target density π.

has a warm-start deﬁned below (refer to the deﬁnition (8)). Table 2 summarizes mixing time bounds from a particular distribution µ . Furthermore, in Section 3.3 we discuss the case when the f is smooth but not strongly convex and show that a suitable adaptation of MALA has a faster mixing rate compared to ULA for this case.

Dwivedi, Chen, Wainwright and Yu

3. Main results

We now state our main results for mixing time bounds for MALA and MRW. In our results, we use c, c to denote universal positive constants. Their values can change depending on the context, but do not depend on the problem parameters in all cases. In this section, we begin by discussing the case of strongly log-concave densities. We state results for MALA and MRW from a warm start in Section 3.1, and from certain feasible starting distributions in Section 3.2. Section 3.3 is devoted to the case of weakly log-concave densities.

3.1. Mixing time bounds for warm start

In the analysis of Markov chains, it is convenient to have a rough measure of the distance between the initial distribution µ0 and the stationary distribution. As in past work on the problem, we adopt the following notion of warmness: For a ﬁnite scalar β > 0, the initial distribution µ0 is said to be β-warm with respect to the stationary distribution Π if

where the supremum is taken over all measurable sets A. In parts of our work, we provide bounds on the quantity

tmix(δ; β) = sup µ0 Pβ(Π) tmix(δ; µ0)

where Pβ(Π) denotes the set of all distributions that are β-warm with respect to Π. Naturally, as the value of β decreases, the task of generating samples from the target distribution becomes easier.4 However, access to a good warm distribution (small β) may not be feasible for many applications, and thus deriving bounds on mixing time of the Markov chain from non-warm starts is also desirable. Consequently, in the sequel, we also provide practical initialization methods and polynomial-time mixing time guarantees from such starts. Our mixing time bounds involve the functions r and w given by

r(s) = 2 + 2 max 1 d0.25 log0.25 1

, 1 d0.5 log0.5 1

w (s) = min m r(s) L

We use TMALA(h) to denote the transition operator on probability distributions induced by one step of MALA. With this notation, we have the following mixing time bound for the MALA algorithm for a strongly-log concave measure from a warm start.

Theorem 1 For any β-warm initial distribution µ0 and any error tolerance δ (0, 1], the Metropolis adjusted Langevin algorithm with step size h = c w(δ/(2β)) satisﬁes the bound T k MALA(h)(µ0) Π TV δ for all iteration numbers

dκ, d0.5κ1.5r δ

where c, c denote universal constants.

4. For instance, β = 1 implies that the chain starts at the stationary distribution and has already mixed.

Metropolis-Hastings algorithms are fast

See Section 5.2 for the proof.

Note that r(s) 4 for s e d and thus we can treat r(δ/2β) as small constant for most interesting values of δ if the warmness parameter β is not too large. Consequently, we can run MALA with a ﬁxed step size h for a large range of error-tolerance δ. Treating the function r as a constant, we obtain that if κ = o(d), the mixing time of MALA scales as O (dκ log(1/δ)). Note that the dependence on the tolerance δ is exponentially better than the O dκ2 log2(1/δ)/δ2 mixing time of ULA, and has better dependence on κ while still maintaining linear dependence on d. In fact, for any setting of κ, d and δ, MALA always has a better mixing time bound compared to ULA. A limitation of our analysis is that the constant c is not small. However, we demonstrate in Section 4 that in practice small constants provide performance that matches the scalings suggested by our theoretical bounds. Let TMRW(h) denote the transition operator on the space of probability distributions induced by one step of MRW. We now state the convergence rate for Metropolized random walk for strongly-log concave density.

Theorem 2 For any β-warm initial distribution µ0 and any δ (0, 1], the Metropolized random walk with step size h = cm d L2r(δ/2β) satisﬁes

T k MRW(h)(µ0) Π TV δ for all k c dκ2r δ

where c, c denote universal constants.

See Section 5.6 for the proof.

Again treating r(δ/2β) as a small constant, we ﬁnd that the mixing time of MRW scales as O dκ2 log(1/δ) which has an exponential factor in δ better than ULA. Compared to the mixing time bound for MALA, the bound in Theorem 2 has an extra factor of O(κ). While such a factor is conceivable given that MALA s proposal distribution uses ﬁrst-order information about the target distribution and MRW uses only the function values, it would be interesting to determine if this gap can be improved. See Section 6 for a discussion on possible future work in this direction.

3.2. Mixing time bounds for a feasible start

In many cases, a good warm start may not be available. Consequently, mixing time bounds from a feasible starting distribution can be useful in practice. Letting x denote the unique mode of the target distribution Π, we claim that the distribution µ = N(x , L 1Id) is one such choice. Recalling the condition number κ = L/m, we claim that the warmness parameter for µ can be bounded as follows:

Π(A) κd/2 = β , (12)

where the supremum is taken over all measurable sets A. When the gradient f is available, ﬁnding x comes at nominal additional cost: in particular, standard optimization algorithms

Dwivedi, Chen, Wainwright and Yu

such as gradient descent be used to compute a δ-approximation of x in O (κ log(1/δ)) steps (e.g., see the monograph by Bubeck 2015). Also refer to Section 3.2.1 for more details when we have inexact parameters. Assuming claim (12) for the moment, we now provide mixing time bounds for MALA and MRW with µ as the starting distribution. For any threshold δ (0, 1], we deﬁne the step sizes h1 = c w(δ/2β ) and h2 = c m d L2 r(δ/2β ), where the function w was previously deﬁned in equation (9b).

Corollary 3 With µ as the starting distribution, we have

T k MRW(h2)(µ ) Π TV δ for all k c d2κ2 log1.5 κ

, and (13a)

T k MALA(h1)(µ ) Π TV δ for all k c d2κ log κ

The proof follows by plugging the bound (12) in Theorems 1 and 2 and is thereby omitted. We now prove the claim (12). Without loss of generality, we can assume that f(x ) = 0. Such an assumption is possible because substituting f( ) by f( ) + α for any scalar α leaves the distribution Π unchanged. Since f is m-strongly convex and L-smooth, applying Lemma 8(c) and Lemma 9(c), we obtain that

2 x x 2 2 f(x) m

2 x x 2 2 , x Rd.

Consequently, we ﬁnd that R

Rd e f(x)dx (2π/m)d/2. Making note of the lower bound

(2πm 1)d/2 , (14)

and plugging in the expression for the density of µ yields the claim (12). We now derive results for the case when we do not have access to exact parameters, e.g., if the mode x is known approximately, and/or we only have an upper bound for the smoothness parameter L a situation quite prevalent in practice.

3.2.1. Starting distribution with inexact parameters

Note that x is also the unique global minimum of the negative log-density f. For the strongly convex function f, using a ﬁrst-order method, like gradient descent, we can obtain an ϵ-approximate mode x using κ log(1/ϵ) evaluations of the gradient f. Suppose we have access to a point x such that x x 2 ϵ and have an upper bound estimate L L for the smoothness. We now consider the case of starting distribution µ = N( x, (2 L) 1Id), as a proxy for the feasible start µ = N(x , L 1Id) discussed above. Note the diﬀerence in mean and the covariance between the distributions µ and µ . Given the handy result in Theorem 1, it suﬃces to bound the warmness parameter for the distribution µ. Applying the triangle inequality, we ﬁnd that

2 x x 2 2 x x 2 2 (15)

Metropolis-Hastings algorithms are fast

and consequently that

µ(x) = (π L 1) d/2 exp L x x 2 2

(π L 1) d/2 exp

L x x 2 2 2 + L x x 2 2

Using the lower bound (14) on the target density, we ﬁnd that

L x x 2 2 ( L L) x x 2 2 2

2 log(2κ L/L) + Lϵ2 ,

where the last inequality follows from the fact that L L. In other words, the distribution µ is β-warm with respect to the target distribution π, where β = exp d 2 log(2κ L/L) + Lϵ2 . Using Theorem 1, we now derive a mixing time bound for MALA with starting distribution µ. For any threshold δ (0, 1], we use the step size h3 = c w(δ/(2 β)). Invoking Theorem 1 and plugging in the deﬁnition (9a) of w, we ﬁnd that T k MALA(h3)( µ) Π TV δ, for all

which also recovers the bound from corollary 3 for MALA as ϵ 0 and L L. Note that the mixing time increases (additively) by O κdϵ2 L/L when we only have an ϵ-

approximate mode, which is an ( L/L ϵ/d)-fraction increase in the mixing time bound with starting distribution µ . A mixing time bound for MRW with starting distribution µ can be obtained in a similar fashion and is thereby omitted.

3.3. Weakly log-concave densities

In this section, we show that MALA can also be used for approximate sampling from a density which is L-smooth but not necessarily strongly log-concave (also referred to as weakly log-concave, see, e.g., Dalalyan 2016). In simple words, the negative log-density f satisﬁes the condition (7a) with parameter L and satisﬁes the condition (7b) with parameter m = 0, (equivalently we have LId 2f(x) 0; see Appendix A for further details.) Note that we still assume that R

Rd e f(x)dx < so that the distribution Π is well-deﬁned. In order to make use of our previous machinery for such a case, we approximate the given log-concave density Π with a strongly log-concave density Π such that Π Π TV is small. Next, we use MALA to sample from Π and consequently obtain an approximate sample from Π. In order to construct Π, we use a scheme previously suggested by Dalalyan (2016). With λ as a tuning parameter, consider the distribution Π given by the density

Rd e f(y)dy e f(x) where f(x) = f(x) + λ

2 x x 2 2 . (17)

Dwivedi, Chen, Wainwright and Yu

Dalalyan (2016) (Lemma 3) showed that that the total variation distance between Π and Π is bounded as follows:

f f L2(π) λ

Rd x x 4 2 π(x)dx 1/2 .

Suppose that the original distribution Π has its fourth moment bounded as Z

Rd x x 4 2 π(x)dx d2ν2. (18)

We now set λ := 2δ/(dν) to obtain Π Π TV δ/2. Since f is λ/2-strongly convex and L + λ/2-smooth, the condition number of Π is given by κ = 1 + Ldν/δ. We substitute κ = Ldν/δ to obtain simpliﬁed expressions for mixing time bounds in the results that follow. Since now the target distribution is Π, we suitably modify the step size for MALA as follows:

Ld min s r(s)

where the function r was previously deﬁned in equation (9a). We refer to this new set-up with a modiﬁed target distribution Π as the modiﬁed MALA method. To keep our results simple to state, we assume that we have a warm start with respect to Π.

Corollary 4 Assume that Π satisﬁes the condition (18). Then for any given error-tolerance δ (0, 1), and, any β-warm start µ0, the modiﬁed MALA method with step size h = cwlc(δ/(2β)) satisﬁes T k MALA(h)(µ0) Π TV δ for all

where c, c denote universal positive constants.

The proof follows by combining the triangle inequality, as applied to the TV norm, along with the bound from Theorem 1. Thus, for weakly log-concave densities, modiﬁed MALA mixes in O d2/δ1.5 , which improves upon the O d3/δ4 mixing time bound for a ULA scheme on Π, as established by Dalalyan (2016). A mixing time bound of O d3/δ2 for MRW can be derived similarly for this case, simply by noting that the new condition number κ = Ldν/δ for the modiﬁed density and the fact that the mixing time of MRW is O d κ2

in the strongly log-concave setting.

4. Numerical experiments

In this section, we compare MALA with ULA and MRW in various simulation settings. The step-size choice of ULA follows from the paper by Dalalyan (2016) in the case of a warm start. The step-size choice of MALA and MRW used in our experiments in our results are summarized in Table 3.

Metropolis-Hastings algorithms are fast

Summary of experiment set-ups and diagnostic tools: We consider four diﬀerent experiments: (i) sampling a multivariate Gaussian (Section 4.1), (ii) sampling a Gaussian mixture (Section 4.2), (iii) estimating the MAP with credible intervals in a Bayesian logistic regression set-up (Section 4.3) and (iv) studying the eﬀect of step-size on the accept reject step (Section 4.4). Since TV distance for continuous measures is hard to estimate, we use several proxy measures for convergence diagnostics: (a) errors in quantiles, (b) ℓ1distance in histograms (which we refer to as discrete tv-error), (c) error in sample MAP estimate, (d) trace-plot along diﬀerent coordinates and (e) autocorrelation plot. While the ﬁrst three measures (a-c) are useful for diagnosing the convergence of random walks over several independent runs, the last two measures (d-e) are useful for diagnosing the rate of convergence of the Markov chain in a single long run.

4.1. Dimension dependence for multivariate Gaussian

The goal of this simulation is to demonstrate the dimension dependence in experiments, for mixing time of ULA, MALA, and MRW when the target is a non-isotropic multivariate Gaussian. Note that Theorems 1 and 2 imply that the dimension dependency for both MALA and MRW is d. We consider sampling from multivariate Gaussian with density π deﬁned by

x 7 π(x) e 1

2 x Σ 1x, (19)

where Σ Rd d the covariance matrix to be speciﬁed. For this target distribution, the function f, its derivatives are given by

2x Σ 1x, f(x) = Σ 1x, and 2f(x) = Σ 1.

Consequently, the function f is strongly convex with parameter m = 1/λmax(Σ) and smooth with parameter L = 1/λmin(Σ). For convergence diagnostics, we use the error in quantiles along diﬀerent directions. Using the exact quantile information for each direction for Gaussians, we measure the error in the 75% quantile of the sample distribution and the true distribution in the least favorable direction, i.e., along the eigenvector of Σ corresponding to the eigenvalue λmax(Σ). The approximate mixing time ˆkmix(δ) is deﬁned as the smallest iteration when this error falls below δ. We use µ as the initial distribution where µ = N 0, L 1Id .

4.1.1. Strongly log-concave density

The step-sizes are chosen according to Table 3. For ULA, the error-tolerance δ is chosen to be 0.2. We set Σ as a diagonal matrix with the largest eigenvalue 4.0 and the smallest eigenvalue 1.0 so that the κ = 4 is ﬁxed across diﬀerent settings. For a ﬁxed dimension d, we simulate 10 independent runs of the three chains each with N = 10, 000 samples to determine the approximate mixing time. The ﬁnal approximate mixing time for each walk is the average of that over these 10 independent runs. Figure 1(a) shows the dependency of the approximate mixing time as a function of dimension d for the three random walks in log-log scale. To examine the dimension dependency, we perform linear regression for approximate mixing time with respect to dimensions in the log-log scale. The computations

Dwivedi, Chen, Wainwright and Yu

MALA ULA MRW

100 101 1/δ

MALA ULA MRW

Figure 1. Scaling of the approximate mixing time ˆkmix (refer to the discussion after equation (19) for the deﬁnition) on multivariate Gaussian density (19) where the covariance has condition number κ = 4. (a) Dimension dependency. (b) Error-tolerance dependency.

reveal that the dimension dependency of MALA, ULA and MRW are all close to order d (slope 0.84, 1.01 and 0.97). Figure 1(b) shows the dependency of the approximate mixing time on the inverse error 1/δ for the three random walks in log-log scale. For ULA, the step-size is error-dependent, precisely chosen to be 10 times of δ. A linear regression of the approximate mixing time on the inverse error 1/δ yields a slope of 2.23 suggesting the error dependency of order 1/δ2 for ULA. A similar computation for MALA and MRW yields a slope of 0.33 for both the cases which not only suggests a signiﬁcantly better error dependency for these two chains but also partly veriﬁes their theoretical mixing time bounds of order log(1/δ).

Random walk ULA MALA MRW

Step size δ2

dκL 1 L min 1

Table 3. Step size used in simulations to obtain δ-accuracy for diﬀerent random walks in Rd with target π e f such that m Id 2f(x) LId for any x Rd and κ := L/m.

4.1.2. Weakly log-concave density

We now discuss the convergence of the random walks when the Gaussian is ﬂat along a direction. In particular, we consider the Gaussian distribution such that λmax(Σ) = 1000 and λmin(Σ) = 1. Such a setting implies that the strong convexity parameter m = 0.001 0 and hence our target density mimics a weakly log-concave density. For convergence diagnostics, we use the error in quantiles along one direction other than the ones which correspond to λmax(Σ) and λmin(Σ). Using the exact quantile information for each direction for Gaussians, we measure the error between the 75% quantile of the sample distribution and the true distribution in that direction. The approximate mixing time is deﬁned as the smallest iteration when this error falls below δ. We use µ as the initial distribution

Metropolis-Hastings algorithms are fast

MALA ULA MRW

101 2 100 3 1004 100 6 100

MALA ULA MRW

Figure 2. Scaling of the approximate mixing time ˆkmix (refer to the discussion after equation (19) for the deﬁnition) for a close to weakly log-concave Gaussian density. (a) Dimension dependency. (b) Error-tolerance dependency for ﬁxed dimension .

where µ = N 0, L 1Id . The step-sizes are chosen according to Table 3 where m is chosen to be δ/(d L). For dimension dependence experiments, we ﬁx the error-tolerance δ as 0.2. For a ﬁxed dimension d, we simulate 10 independent runs of the three chains each with N = 10, 000 samples to determine the approximate mixing time. The ﬁnal approximate mixing time for each walk is the average of that over these 10 independent runs. Figure 2(a) and 2(b) show the dependency of the approximate mixing time as a function of dimension d and the inverse error 1/δ respectively, for the three random walks on this weakly logconcave density (log-log scale). Linear ﬁts on the log-log scale reveal that the dimension dependence of mixing time for MALA is close to d2 (slope 1.61), and that for ULA is close to d3 (slope 2.78) and for MRW it is approximately of order d3 (slope 2.73). Linear ﬁts of the approximate mixing time on the inverse error 1/δ yield a slope of 3.92 for ULA thereby suggesting an error dependence of order 1/δ4, while for MALA and MRW this dependence is of order 1/δ1.5 (slope 1.56) and of order 1/δ2 (slope 2.01), respectively. These scalings partly verify the rates derived in Corollary 4 and demonstrate the gains of MALA over ULA for the weakly log-concave densities.

4.1.3. Warmness in simulations

Strictly speaking, for both the cases considered above, the starting distribution was not warm since we used µ as the starting distribution and the corresponding warmness parameter β = O(ed) scales exponentially with dimension d. However, the mixing time observed in the simulations, albeit with a heuristic measure, are d times faster than those stated with µ as the starting distribution in Corollary 3, and are in fact consistent with the results for the warm-start which are stated in Theorems 1 and 2. We believe that the results stated in Corollary 3, with µ as the starting distribution can be improved by a factor of d. See Section 6 for a discussion on recent work on this question.

Dwivedi, Chen, Wainwright and Yu

4.2. Behavior for Gaussian mixture distribution

We now consider the task of sampling from a two component Gaussian mixture distribution, as previously considered by Dalalyan (2016) for illustrating the behavior of ULA. Here compare the behavior of MALA to ULA for this case. The target density is given by

x 7 π(x) = 1

e x a 2 2/2 + e x+a 2 2/2 ,

where a Rd is a ﬁxed vector. This density corresponds to the two-mixture of equal weighted Gaussians N(a, Id) and N( a, Id). In our notation, the function f and its derivatives are given by: f(x) = 1

2 x a 2 2 log(1 + e 2x a),

f(x) = x a + 2a(1 + e2x a) 1, and , 2f(x) = Id 4aa e2x a 1 + e2x a 2 .

From examination of the Hessian, we see that the function f is smooth with parameter L = 1, and whenever a 2 < 1, it is also strongly convex with parameter m = 1 a 2 2. For dimension d = 2, setting a = 1

2 yields the parameters m = 1

2 and L = 1. Figure 3 shows the level sets of the density of this 2D-Gaussian mixture. The initial distribution is chosen as µ = N 0, L 1Id and the step-sizes are chosen according to Table 1, where for ULA, we set three diﬀerent choices of δ = 0.2 (ULA), δ = 0.1 (small-step ULA) and δ = 1.0 (large-step ULA). Note that choosing a smaller threshold δ implies that the ULA has a smaller step size and consequently the chain takes larger to converge. However, the asymptotic TV error with respect to the target distribution Π for ULA also decreases with a decrease in step size. These diﬀerent choices of step sizes are made to demonstrate the fundamental trade-oﬀbetween the rate of convergence and asymptotic error for ULA and its inability to mix faster than MALA for diﬀerent settings. Note that one can sample directly from the mixture of Gaussian under consideration by drawing independently a Bernoulli(1/2) random variable y and a standard normal variable z N(0, Id), and then outputting the random variable

x = y (z a) + (1 y) (z + a)

This observation makes it easy to diagnose the convergence of our Markov chains with target π. In order to estimate the total variation distance, we discretize the distribution of N = 250, 000 samples from π over a set of bins, and consider the total variation of this discrete distribution from the empirical distribution of the Markov chain over these bins. We refer to this measure as the discretized TV error. We measure the sum of two discrete TV errors of 250, 000 samples from π with the empirical distribution obtained by simulating the chains ULA, MALA or MRW, projected on two principal directions (u1 and u2), over a discrete grid of size B = 100. Figure 4 shows the sum of the discretized TV errors along u1 and u2, as a function of iterations. The true total variation distance between the distribution of the iterate and the target distribution is upper bounded by the sum of (a) the discretized TV error and (b) the error caused by discretization. In order to obtain a sense of the magnitude of the error of type (b), we simulate 100 runs of the discrete TV error between two independent drawings from the true distribution π. The two black lines

Metropolis-Hastings algorithms are fast

Figure 3. Level set of the density of the 2D Gaussian mixture. The red dots are the location of the means a and a, where a is chosen such that a 2 2 = 1

2. The arrows indicate the two principal directions u1 and u2 along which the TV error is measured.

in Figure 4 are the maximum and minimum of these 100 values. The sample distribution at convergence is expected to lie between the two black lines. Figure 4(a) shows that ULA converges signiﬁcantly slower than MALA to the right distribution. Figure 4(b) illustrates this point further and shows that when compared to the ULA, the small-step ULA (δ = 0.1) converges at a much slower rate and large-step ULA (δ = 1.0) has a larger approximation error (asymptotic bias).

0 100 200 300 400 500 600 Iteration

Discrete TV error

ULA MALA MRW

0 100 200 300 400 500 600 Iteration

Discrete TV error

ULA ULA large ULA small

Figure 4. Discrete TV error on a two component Gaussian mixture. (a) Behavior of three diﬀerent random walks. (b) Behavior of ULA with diﬀerent choices of step sizes.

We accompany the study based on exact TV error computation with two classical convergence diagnostic plots for general MCMC algorithms. Figure 5 shows the trace plots of the three sampling algorithms in 10 runs. Comparing the three panels (a) (c) in Figure 5, we observe that the trace plot of MALA stabilizes much faster than that of ULA and MRW.

Dwivedi, Chen, Wainwright and Yu

Furthermore, to compare the eﬃciency of the chains in stationarity, Figure 6 shows the autocorrelation function of the three chains. In order to ensure that these autocorrelations are computed for the stationary distribution, we set in practice the burn-in period to be 300 iterations. Again, we observe that MALA is more eﬃcient than ULA and MRW.

0 100 200 300 400 500 600 Iteration

Sample value on the 1st coordinate

0 100 200 300 400 500 600 Iteration

Sample value on the 1st coordinate

0 100 200 300 400 500 600 Iteration

Sample value on the 1st coordinate

(a) (b) (c)

Figure 5. Trace-plot of the ﬁrst coordinate on a two component Gaussian mixture. (a) Trace-plot of ULA. (b) Trace-plot of MALA. (c) Trace-plot of MRW.

0 100 200 300 400 500 600 Lag

Autocorrelation

ULA MALA MRW

Figure 6. Markov chain autocorrelation function plot. The burn-in time for the plot is set to 300 iterations.

4.3. Bayesian Logistic Regression

We now consider the problem of logistic regression in a frequentist-Bayesian setting, similar to that considered by Dalalyan (2016). Once again, we establish that MALA has superior performance relative to ULA. Given a binary variable y {0, 1} and a covariate x Rd, the logistic model for the conditional distribution of y given x takes the form

P(y = 1|x; θ) = eθ x

1 + eθ x , (20)

for some parameter θ Rd. In a Bayesian framework, we model the parameter θ in the logistic equation as a random variable with a prior distribution π0. Suppose that we observe a set of independent

Metropolis-Hastings algorithms are fast

samples {(xi, yi)}n i=1 with (xi, yi) Rd {0, 1}, with each yi conditioned on xi drawn from a logistic distribution with some unknown parameter θ . Using Bayes rule, we can then compute the posterior distribution of the parameter θ given the data. Drawing samples from this posterior distribution allows us to estimate and draw inferences about the unknown parameter. Under mild conditions, the Bernstein-von-Mises theorem guarantees that the posterior distribution will concentrate around the true parameter θ , in which case we expect that the credible intervals formed by sampling from the posterior should contain θ

with high probability. This fact provides a lens for us to assess the accuracy of our sampling procedure. Deﬁne the vector Y = (y1, . . . , yn) {0, 1}n and let X be the n d matrix with xi as ith-row. We choose the prior π0 to be a Gaussian distribution with zero mean and covariance matrix proportional to the inverse of the sample covariance matrix ΣX = 1

n X X. Plugging in the formulas for the prior and likelihood, we ﬁnd that the the posterior density is given by

π(θ) = π(θ|X, Y ) exp

i=1 log 1 + eθ xi α Σ1/2 X θ 2

where α > 0 is a user-speciﬁed parameter. Writing π e f, we observe that the function f and its derivatives are given by

f(θ) = Y Xθ +

i=1 log 1 + eθ xi + α Σ1/2 X θ 2

f(θ) = X Y +

xi 1 + e θ xi + αΣXθ, and,

e θ xi 1 + e θ xi 2 xix i + αΣX.

With some algebra, we can deduce that the eigenvalues of the Hessian 2f are bounded between L := (0.25n + α) λmax(ΣX) and m := α λmin(ΣX) where λmax(ΣX) and λmin(ΣX) denote the largest and smallest eigenvalues of the matrix ΣX. We make use of these bounds in our experiments. As in Dalalyan (2016), we also consider a preconditioned version of the method; more precisely, we ﬁrst sample from πg e g where g(θ) = f(Σ 1/2 X θ), and then transform the obtained random samples θi 7 Σ1/2 X θi to obtain samples from π. Sampling based on the preconditioned distribution improves the condition number of the problem. After the preconditioning, we have the bounds Lg 0.25n+α and mg α, so that the new condition number is now independent of the eigenvalues of ΣX. We randomly draw i.i.d. samples (xi, yi) as follows. Each vector xi Rd is sampled i.i.d. Rademacher components, and then renormalized to Euclidean norm. given xi, the response yi is drawn from the logistic model (20) with θ = θ = 1d = (1, . . . , 1) . We ﬁx d = 2, n = 50 and perform N = 1000 experiments. In order to sample from the posterior, we start with the initial distribution as µ0 = N(0, L 1Id). As the ﬁrst error metric, we

Dwivedi, Chen, Wainwright and Yu

0 1000 2000 3000 4000 Iteration

ULA large ULA MALA MRW

0 1000 2000 3000 4000 Iteration

ULA large ULA MALA MRW

Figure 7. Mean error as a function of iteration number. (a) Without preconditioning. (b) With preconditioning.

measure the ℓ1-distance between the true parameter θ and the sample mean ˆθk of the random samples obtained from simulating the Markov chains for k iterations:

Figure 7 provides a log-scale plot of this error versus the iteration number. Since there is always an approximation error caused by the prior distribution, ULA with large stepsize (δ = 1.0) can be used. However, our simulation shows that it is still slower than MALA. Also, the condition number κ has a signiﬁcant eﬀect on the mixing time of ULA and MRW. Their convergence in the preconditioned case is signiﬁcantly better. Furthermore, the autocorrelation plots in Figure 8 and the plots in Figure 9 of the sample (across experiments) mean and 25% and 75% quantiles, with θ subtracted, as a function of iterations suggest a similar story: MALA converges faster than ULA and is less aﬀected by the conditioning of the problem.

4.4. Step size vs accept-reject rate

In this section, we provide a few simulations that highlight the eﬀect of step size for MALA and MRW. Note that our bounds from Theorem 1 and 2 suggest a step size choice of order d 1 for both MALA and MRW, which in turn led to the mixing time bounds of O (d). These choices of step sizes arise when we try to provide a worst-case control on the accept-reject step of these algorithms. In particular, these choices ensure that the Markov chains do not get stuck at a given state x, or equivalently, that the proposals at any given state are accepted with constant probability. If instead, one chooses a very large step size, the (worst-case) probability of acceptance may decay exponentially with dimensions. Nonetheless, these worst-case bounds may not hold, which would imply a faster mixing time for these chains if a larger step size were to be used. To check the validity of larger step sizes, we repeated a few experiments discussed above, albeit with a larger step size. In particular, we simulated the random walks for a wide-range of step sizes d γ for γ {0.2, 0.33, 0.5, 0.67} for MALA, and, γ {0.4, 0.67, 1, 1.33} for

Metropolis-Hastings algorithms are fast

0 1000 2000 3000 4000 5000 6000 Lag

Autocorrelation

ULA MALA MRW

0 1000 2000 3000 4000 5000 6000 Lag

Autocorrelation

ULA MALA MRW

Figure 8. Autocorrelation function plot of the ﬁrst coordinate of the estimate as a function lag. The burn-in time for the plot is set to 300 iterations. (a) Without preconditioning. (b) With preconditioning.

0 1000 2000 3000 4000 Iteration

ULA large MALA

0 1000 2000 3000 4000 Iteration

ULA large MALA

Figure 9. Mean and 25% and 75% quantiles, with θ subtracted, as a function of iteration number. (a) Without preconditioning. (b) With preconditioning.

MRW. We ran these chains for two diﬀerent cases: (a) Sampling from non-isotropic Gaussian density, discussed in Section 4.1, and, (b) Posterior sampling in Bayesian logistic regression, discussed in Section 4.3). In Figure 10, we plot the average acceptance probability for diﬀerent step sizes d γ as the dimension d increases. These probabilities were computed as the average number of proposals accepted over 100 iterations after a manually tuned burn-in period, and further averaged across 50 independent runs. We now remark on the observations from Figure 10. We see that for MALA the acceptance probability for the step size choice of d 0.2 vanishes as d increases. Indeed, the choice of d 0.5 appears to be a safe choice for both cases. In contrast, for MRW, we need a smaller step size. From panels (c) and (d), we see that d 1 appears to be the correct choice to ensure that the proposals are accepted with a constant probability when the dimension d is large. Informally, if a step size choice of d γ was to guarantee a non-vanishing acceptance probability for MALA or MRW, our proof techniques imply a mixing time bound of O (dγ).

Dwivedi, Chen, Wainwright and Yu

0 1000 2000 3000 4000 5000 Dimension d

Accept Rate

γ = 0.20 γ = 0.33 γ = 0.50 γ = 0.67

0 100 200 300 400 500 600 Dimension d

Accept Rate

γ = 0.20 γ = 0.33 γ = 0.50 γ = 0.67

(a) MALA: Non-isotropic Gaussian (b) MALA: Bayesian logistic regression

0 1000 2000 3000 4000 5000 Dimension d

Accept Rate

γ = 0.40 γ = 0.67 γ = 1.00 γ = 1.33

0 100 200 300 400 500 600 Dimension d

Accept Rate

γ = 0.40 γ = 0.67 γ = 1.00 γ = 1.33

(c) MRW: Non-isotropic Gaussian (d) MRW: Bayesian logistic regression

Figure 10. Eﬀect of large step size for accept-reject ratio for MALA and MRW. From panels (a) and (b), we see that for MALA the step size choice of d 0.5 has a non-vanishing acceptance probability rate for both cases. On the other hand, panels (c) and (d) show that for MRW d 1 is a good choice for the step size.

Combining this argument with the observations above, we suspect that the bounds for MALA from Theorem 1 may not be tight, while for MRW the bounds from Theorem 2 are very likely to be tight. Deriving a faster mixing time for MALA or establishing that the current dimension dependency for MRW is tight, are interesting research directions and we leave the further investigation of these questions for future work.

We now turn to the proofs of our main results. In Section 5.1, we begin by introducing some background on conductance bounds, before stating three auxiliary lemmas that underlie the proofs of our main theorems. Taking these three lemmas as given, we then provide the proof of Theorem 1 in Section 5.2. Sections 5.3 through 5.5 are devoted to the proofs of our three key lemmas, and we conclude with the proof of Theorem 2 in Section 5.6.

Metropolis-Hastings algorithms are fast

5.1. Conductance bounds and auxiliary results

Our proofs exploit standard conductance-based arguments for controlling mixing times. Consider an ergodic Markov chain deﬁned by a transition operator T , and let Π be its stationary distribution. For each scalar s (0, 1/2), we deﬁne the s-conductance

Φs := inf Π(A) (s,1 s)

A Tu(Ac)π(u)du min {Π(A) s, Π(Ac) s}. (21)

In this formula, the notation Tu is shorthand for the distribution T (δu) obtained by applying the transition operator to a dirac distribution concentrated on u. In words, the s-conductance measures how much probability mass ﬂows across disjoint sets relative to their stationary mass. By a continuity argument, it can be seen that limiting conductance of the chain is equal to the limiting value of s-conductance that is, Φ = lims 0 Φs. For a reversible lazy Markov chain with β-warm start, Lov asz 1999 (see also Kannan et al. (1995)) proved that

T k(µ0) Π TV βs + β 1 Φ2 s 2

k βs + βe kΦ2 s/2 for any s 0, 1

In order to make eﬀective use of this lower bound, we need to lower bound the s-conductance Φs, and then choose the parameter s so as to optimize the tradeoﬀbetween the two terms in the bound. We now state some auxiliary results that are useful. We start with a result that shows that the probability mass of any strongly log concave distributions is concentrated in a Euclidean ball around the mode. For each s (0, 1), we introduce the Euclidean ball

where the function r was previously deﬁned in equation (9a), and x := arg max x Rd π(x)

denotes the mode.

Lemma 5 For any s 0, 1

2 , we have Π(Rs) 1 s.

See Section 5.3 for the proof of this claim.

In order to establish the conductance bounds inside this ball, we ﬁrst prove an extension of a result by Lov asz (1999). The next result provides a lower bound on the ﬂow of Markov chain with transition distribution Tx and strongly log concave target distributions Π. Similar results have been used in several prior works to establish fast mixing of several random walks like ball walk, Hit and run (Lov asz, 1999; Lov asz and Vempala, 2006, 2007), Dikin walk (Narayanan, 2016) and Vaidya and John walks (Chen et al., 2018).

Lemma 6 Let K be a convex set such that Tx Ty TV 1 ρ whenever x, y K and x y 2 . Then for any measurable partition A1 and A2 of Rd, we have Z

A1 Tu(A2)π(u)du ρ

4 min 1, log 2 Π2(K) m

min {Π(A1 K), Π(A2 K)} .

Dwivedi, Chen, Wainwright and Yu

See Section 5.4 for the proof of this lemma. We next introduce a few pieces of notations to state a MALA speciﬁc result. Deﬁne a function w : (0, 1) (0, 1) R+ as follows:

w(s, ϵ) := min ϵ 8

d L , ϵ 64αϵ

26(αϵr2(s))1/3 1 L

1/3 , (25a)

where αϵ := 1 + 2 p

log(16/ϵ) + 2 log(16/ϵ), (25b)

and the function r was deﬁned in equation (9a). In the next lemma, we show two important properties for MALA: (1) the proposal distributions of MALA at two diﬀerent points are close if the two points are close, and (2) the accept-reject step of MALA is well behaved inside the ball Rs provided the step size is chosen carefully. Note that for MALA, the proposal distribution of the chain at x is given by

PMALA(h) x = N(µx, 2h Id), where µx = x h f(x). (26)

We use T MALA(h) x to denote the transition distribution of MALA.

Lemma 7 For any step size h 0, 2

L , the MALA proposal distribution satisﬁes the bound

sup x,y Rd x =y

P MALA(h) x P MALA(h) y TV x y 2 1

Moreover, given scalars s (0, 1/2) and ϵ (0, 1), then the MALA proposal distribution for any step size h 0, w(s, ϵ) satisﬁes the bound

sup x Rs PMALA(h) x T MALA(h) x TV ϵ

where the truncated ball Rs was deﬁned in equation (23).

See Section 5.5 for the proof of this claim. With these results in hand, we are now equipped to prove the mixing time bound for MALA.

5.2. Proof of Theorem 1

At a high level, the proof involves three key steps. Our ﬁrst step is to use Lemma 7 to establish that for an appropriate choice of step size, the MALA update has nice properties inside a high probability region given by Lemma 5. The second step is to apply Lemma 6 so as to obtain a lower bound on the s-conductance Φs of the MALA update. Finally, by making an appropriate choice of parameter s, we establish the claimed convergence rate. In order to simplify notation, we drop the superscripts MALA(h) from our notation that is, we use Tx and Px, respectively, to denote the transition and proposal distributions at x for MALA, each with step size h. By applying the triangle inequality, we obtain the upper bound

Tx Ty TV Px Tx TV + Px Py TV + Py Ty TV. (28)

Metropolis-Hastings algorithms are fast

Now applying claim (27a) from Lemma 7 guarantees that

Px Py TV ϵ/

2 for all x, y Rd such that x y 2 ϵ

Furthermore, for any h w(s, ϵ), the bound (27b) from Lemma 7 implies that Px Tx TV ϵ/8 for any x Rs. Plugging in these bounds in the inequality (28), we ﬁnd that

Tx Ty TV 1 (1 ϵ) x, y Rs such that x y 2 ϵ

Thus, the transition distribution Tx satisﬁes the assumptions of Lemma 6 for

K = Rs, ρ = (1 ϵ) and = ϵ

We now derive a lower bound on the s-conductance of MALA. Choosing a measurable set A such that Π(A) > s and substituting the terms from equation (29) in the inequality (24), we ﬁnd that Z

A Tu(Ac)π(u)du (1 ϵ)

min {Π(A Rs), Π(Ac Rs)}

64 min {Π(A) s, Π(Ac) s} .

In this argument, inequality (i) follows from the facts that log 2 1/2 and Π(A), Π(Ac) > s. Moreover, we have applied Lemma 5 to ﬁnd that Π(Rs) 1 s and hence

Π(X Rs) = Π(X) Π(X Rc s) Π(X) s for X {A, Ac}.

We have also assumed that the second argument of the minimum is less than 1. Applying the deﬁnition (21) of Φs for MALA, we ﬁnd that

ΦMALA(h) s (1 ϵ)ϵ Π2(Rs)

hm 64 , for any h w(s, ϵ). (30)

By making a suitable choice of s, we can now complete the proof. Using Lemma 5, we have that Π(Rδ/2) 1 δ/2 1/2 for any δ (0, 1). Applying the deﬁnition (25b) of αϵ, we obtain that α1/2 12. Using this fact and the deﬁnitions (9b) and (25a) for the functions w( ) and w( , ), it is straightforward to verify that cw(δ/(2β)) w(δ/(2β), 1/2), for an appropriate choice of universal constant c. Substituting in s = δ/(2β), ϵ = 1/2, and h = cw(δ/(2β)), and also making use of the lower bound Π(Rδ/2β) 1/2 in the

bound (30), we ﬁnd that Φ MALA(h) δ/2β c

mh for some universal constant c . Using the convergence rate (22), we obtain that

T k MALA(h)(µ0) Π TV β δ

2β + βe kmh/c δ for all k c

for a suitably large constant c . Substituting the expression (9b) for h = cw(δ/(2β)), yields the claimed bound on mixing time.

Dwivedi, Chen, Wainwright and Yu

5.3. Proof of Lemma 5

The proof consists of two main steps. First, we establish that the distribution Π is sub Gaussian, which then guarantees concentration around the mean. Second, we show that the mean and the mode of the distribution Π are not far apart. Combining these two claims yields a high probability region around the mode x . Let x denote the random variable with distribution Π and mean x = Ex Π [x]. We claim that x x is a sub-Gaussian random vector with parameter 1/ m, meaning that

Ex h eu (x x)i e u 2 2/(2m) for any vector u Rd.

In order to prove this claim, we make use of a result due to Harg e (2004) (Theorem 1.1), which we restate here. Let y N(µ, Σ) with density e and x be a random variable with density function q e where q is a log-concave function. Then for any convex function g : Rd 7 R we have

Ex [g(x E[x])] Ey [g(y E[y])] . (32)

From Lemma 8(b) we have that x 7 f(x) m

2 x x 2 2 is a convex function. Thus we can express the density π as the product of a log concave function and the density of a random variable with distribution N(x , Id/m). Letting y N(x , Id/m) and noting that g(z) := eu z is a convex function for each ﬁxed vector u, applying the Harg e bound (32) yields

Ex h eu (x x)i Ey h eu (y x )i (i) e u 2 2/2m.

Here inequality (i) follows from the fact that the random vector y x is sub-Gaussian with parameter 1/ m. Using the standard tail bounds for quadratic forms for sub-Gaussian random vectors (e.g., Theorem 1 by Hsu et al. 2012), we ﬁnd that

x x 2 2 > d

Deﬁne B1 := B x, q

d m r(s) where r(s) = 1 + 2 max log(1/s)

serve that r(s)2 1 + 2 q

d + 2log(1/s)

d and consequently the bound (33) implies that Π (B1) = Px Π [x B1] 1 s. Now applying triangle inequality, we obtain that

x , x x 2 +

From Theorem 1 by Durmus et al. (2019), we have that Ex Π x x 2 2 d/m. Using Jensen inequality twice, we ﬁnd that

x x 2 = Ex Π [x] x 2 Ex Π x x 2 q

Ex Π x x 2 2

Noting the relation r(s) = 1 + r(s), we thus obtain that x x 2 + q

d m r(s) r(s) q

d m and consequently B1 B2 Rs. As a result, we obtain Π (Rs) Π (B1) 1 s as claimed.

Metropolis-Hastings algorithms are fast

5.4. Proof of Lemma 6

The proof of this lemma is based on the ideas employed in prior works to establish conductance bounds, ﬁrst for Hit-and-run (Lov asz, 1999), and since then for several other random walks (Lov asz and Vempala, 2007; Narayanan, 2016; Chen et al., 2018). See the survey by Vempala (2005) for further details. For our setting, a key ingredient is the following isoperimetric inequality for log-concave distributions. Let Rd = S1 S2 S3 be a partition. Let y N(0, σ2Id) with density e and let Π be a distribution with a density given by q e where q is a log-concave function. Then Cousins and Vempala (2014) (Theorem 4.4) proved that

Π(S3) log 2 d(S1, S2)

σ Π(S1)Π(S2) (35)

where d(S1, S2) := inf x y 2 x S1, y S2 . We invoke this result for the truncated distribution ΠK with the density πK deﬁned as

πK(x) := 1 Z

K π(y)dy π(x)1K(x) = 1 Z

K e f(y)dy e f(x)1K(x), (36)

where 1K( ) denotes the indicator function for the set K, i.e., we have 1K(x) = 1 if x K, and 0 otherwise. Let x = arg max π(x) = arg min f(x). Observe that m-strong-convexity of f implies that x 7 f(x) m

2 x x 2 2 is a convex function (Lemma 8(b)). Noting that the function 1K( ) is log-concave and that log-concavity is closed under multiplication, we conclude that πK can be expressed as a product of log-concave function and density of the Gaussian distribution N x , 1

m Id . Consequently, we can apply the result (35) with Π replaced by ΠK and σ = 1/ m. We now prove the claim of the lemma. Deﬁne the sets

A 1 := n u A1 K | Tu(A2) < ρ

o , A 2 := n v A2 K | Tv(A1) < ρ

along with the complement A 3 := K\(A 1 A 2). See Figure 11 for an illustration. Based on these three sets, we split our proof of the claim (24) into two distinct cases:

Case 1: Π(A 1) Π(A1 K)/2 or Π(A 2) Π(A2 K)/2.

Case 2: Π(A i) Π(Ai K)/2 for i = 1, 2.

Note that these cases are mutually exclusive, and cover all possibilities.

Case 1 We have Π(A1 K\A 1) Π(A1 K)/2, then Z

A1 Tu(A2)π(u)du (i) Z

A1 K\A 1 Tu(A2)π(u)du (ii) ρ

2Π(A1 K\A 1)

which implies the claim (24). In the above sequence of inequalities, step (i) is trivially true; step (ii) from the deﬁnition (37) of the set A 1, and step (iii) from the assumption for this case. A similar argument with the roles of A1 and A2 switched, establishes the claim when Π(A 2) Π(A2 K)/2.

Dwivedi, Chen, Wainwright and Yu

Figure 11. The sets A1 and A2 form a partition of Rd, and we use K to denote a compact convex subset. The sets A 1 and A 2 are deﬁned in equation (37).

Case 2 We have Π(A i) Π(Ai K)/2 for both i = 1 and 2. For any u A 1 and v A 2, we have that

Tu Tv TV Tu(A1) Tv(A1) (i) = 1 Tu(A2) Tv(A1) > 1 ρ,

where step (i) follows from the fact that A1 = Rd\A2 and thereby Tu(A1) = 1 Tu(A2). Since u, v K, the assumption of the lemma implies that u v 2 and consequently

d(A 1, A 2) . (38)

We claim that Z

A1 Tu(A2)π(u)du = Z

A2 Tv(A1)π(v)dv (39)

We provide the proof of this claim at the end. Assuming this claim as given, we now complete the proof. Using equation (39), we have

A1 Tu(A2)π(u)du = 1

A1 Tu(A2)π(u)du + Z

A2 Tv(A1)π(v)dv

A1 K\A 1 Tu(A2)π(u)du + Z

A2 K\A 2 Tv(A1)π(v)dv

8Π(K\(A 1 A 2)), (40)

Metropolis-Hastings algorithms are fast

where step (i) follows from the deﬁnition (37) of the set A 3 = K\(A 1 A 2). Further, we have

Π(K\(A 1 A 2)) (i) = Π(K) ΠK(K\A 1\A 2)

(ii) Π(K) log 2 d(A 1, A 2) 1/ m ΠK(A 1) ΠK(A 2)

(iii) Π(K) log 2 d(A 1, A 2) m Π(A 1) Π(A 2)

(iv) Π(K) log 2 m 1

4 Π(A1 K) Π(A2 K). (41)

where step (i) follows from the deﬁnition (36) of the truncated distribution ΠK, step (ii) follows from applying the isoperimetry (35) for the distribution ΠK with σ = 1/ m, step (iii) from the deﬁnition of ΠK and step (iv) from inequality (38) and the assumption for this case. Let α := Π(A1 K)/Π(K). Note that α [0, 1] and Π(A2 K)/Π(K) = 1 α. We have

Π(A1 K) Π(A2 K) = Π2(K) α(1 α)

2 min {α, 1 α}

2 min {Π(A1 K), Π(A2 K)} (42)

Putting the inequalities (40), (41) and (42) together, establishes the claim (24) of the lemma for this case. We now prove our earlier claim (39). Note that it suﬃces to prove that Z

A1 Tu(A2)π(u)du = Z

A2 Tv(A1)π(v)dv.

A2 Tu(A1)π(u)du (i) = Z

Rd Tu(A1)π(u)du Z

A1 Tu(A1)π(u)du

(ii) = Π(A1) Z

A1 Tu(A1)π(u)du

A1 π(u)du Z

A1 Tu(A1)π(u)du

A1 Tu(A2)π(u)du,

where steps (i) and (iii) (respectively) follow from the fact that A1 = Rd\A2 and the consequent fact that 1 Tu(A1) = Tu(A2), and step (ii) follows from the fact that π is the stationary density for the transition distribution Tx and thereby R

Rd Tu(A1)π(u)du = Π(A1).

5.5. Proof of Lemma 7

We prove each claim of the lemma separately. To simplify notation, we drop the superscript from our notations of distributions T MALA(h) x and P MALA(h) x .

Dwivedi, Chen, Wainwright and Yu

5.5.1. Proof of claim (27a)

In order to bound the total variation distance Px Py TV, we apply Pinsker s inequality (Cover and Thomas, 1991), which guarantees that Px Py TV p

2 KL(Px Py). Given multivariate normal distributions G1 = N (µ1, Σ) and G2 = N (µ2, Σ), the Kullback-Leibler divergence between the two is given by

KL(G1 G2) = 1

2 (µ1 µ2) Σ 1 (µ1 µ2) . (43)

Substituting G1 = Px and G2 = Py into the above expression and applying Pinsker s inequality, we ﬁnd that

2 KL(Px Py) = µx µy 2

(i) = (x h f(x)) (y h f(y)) 2

where step (i) follows from the deﬁnition (26) of the mean µx. Consequently, in order to establish the claim (27a), it suﬃces to show that

(x h f(x)) (y h f(y)) 2 x y 2 .

Recalling that |||B|||op denotes the ℓ2-operator norm of a matrix B (equal to the maximum singular value), we have

(x h f(x)) (y h f(y)) 2 =

I h 2f(x + t(x y)) (x y)dt 2

I h 2f(x + t(x y)) (x y) 2 dt

(i) sup z Rd |||Id h 2f(z)|||op x y 2 ,

where step (i) follows from the deﬁnition of the operator norm. Lemma 8(f) and Lemma 9(f) guarantee that the Hessian is sandwiched as m Id 2f(z) LId for all z Rd, where Id denotes the d-dimensional identity matrix. From this Hessian sandwich, it follows that

|||Id h 2f(x)|||op = max {|1 h L| , |1 hm|} < 1.

Putting together the pieces yields the claim.

5.5.2. Proof of claim (27b)

Let P1 be a distribution admitting a density p1 on Rd, and let P2 be a distribution which has an atom at x and admitting a density p2 on Rd\ {x}. The total variation distance between the distributions P1 and P2 is given by

P1 P2 TV = 1

P2({x}) + Z

Rd |p1(z) p2(z)| dz . (44)

Metropolis-Hastings algorithms are fast

The accept-reject step for MALA implies that

Tx({x}) = 1 Z

Rd min 1, π(z) pz(x)

px(z)dz, (45)

where px denotes the density corresponding to the proposal distribution Px = N(x h f(x), 2h Id). From this fact and the formula (44), we ﬁnd that

Px Tx TV = 1

Tx({x}) + Z

Rd px(z)dz Z

Rd min 1, π(z) pz(x)

Rd min 1, π(z) pz(x)

min 1, π(z) pz(x)

By applying Markov s inequality, we obtain

min 1, π(z) pz(x)

α P π(z) pz(x)

π(x) px(z) α for all α (0, 1]. (47)

We now derive a high probability lower bound for the ratio [π(z)pz(x)] / [π(x)px(z)]. Noting that π(x) exp( f(x)) and px(z) exp x h f(x) z 2 2 /(4h) , we have

π(z) pz(x) π(x) px(z) = exp f(z) x z+h f(z) 2 2 4h

exp f(x) z x+h f(x) 2 2 4h

= exp 4h(f(x) f(z)) + z x + h f(x) 2 2 x z + h f(z) 2 2 4h

Keeping track of the numerator of this exponent, we ﬁnd that

4h(f(x) f(z)) + z x + h f(x) 2 2 x z + h f(z) 2 2 = 4h(f(x) f(z)) + z x 2 2 + h f(x) 2 2 + 2h(z x) f(x)

x z 2 2 h f(z) 2 2 2h(x z) f(z)

= 2h (f(x) f(z) (x z) f(x)) | {z } M1

+2h (f(x) f(z) (x z) f(z)) | {z } M2

+ h2 f(x) 2 2 f(z) 2 2

Now we provide lower bounds for the terms Mi, i = 1, 2, 3 deﬁned in the above display. Since f is strongly convex and smooth, applying Lemma 8(c) and Lemma 9(c) yields

2 x z 2 2 , and M2 m

2 x z 2 2 . (50)

Dwivedi, Chen, Wainwright and Yu

In order to lower bound M3, we observe that

M3 = f(x) 2 2 f(z) 2 2 = f(x) + f(z), f(x) f(z)

(i) f(x) + f(z) 2 f(x) f(z) 2 (ii) (2 f(x) 2 + L x z 2) L x z 2 , (51)

where step (i) follows from the Cauchy-Schwarz s inequality and step (ii) from the triangle inequality and L-smoothness of the function f (cf. Lemma 9(d)). Combining the bounds (50) and (51) with equations (49) and (48), we have established that

π(z) pz(x) π(x) px(z) exp

4(L m) x z 2 2 h

2L x z 2 f(x) 2 + L2 x z 2 2

Now to provide a high probability lower bound for the term T, we make use of the standard chi-squared tail bounds and the following relation between x and z:

z (d) = x h f(x) +

where ξ N(0, Id) and (d) = denotes equality in distribution. We have

x z 2 = h f(x) +

2hξ 2 h f(x) 2 +

which also implies

x z 2 2 2h2 f(x) 2 2 + 4h ξ 2 2 .

Using these two inequalities, we ﬁnd that

2 f(x) 2 2 (L m)h ξ 2 2 Lh2

2 f(x) 2 ξ 2 L2h3

2 f(x) 2 2 L2h2 ξ 2 2 .

Simplifying and using the fact that Lh 1, we obtain that

T 2 Lh2 f(x) 2 2 + Lh ξ 2 2 + Lh

h f(x) 2 ξ 2 .

Since x Rs, we have

f(x) 2 = f(x) f(x ) 2

(i) L x x 2 L

d mr(s) =: Ds, (53)

Metropolis-Hastings algorithms are fast

where inequality (i) follows from the property (d) of Lemma 9. Thus, we have shown that

T 2 Lh2D2 s + Lh ξ 2 2 + Lh

h Ds ξ 2 . (54)

Standard tail bounds for χ2-variables guarantee that

P h ξ 2 2 dαϵ i (1 ϵ/16) for αϵ = 1 + 2 p

log(16/ϵ) + 2 log(16/ϵ).

A simple observation reveals that the function w deﬁned in equation (25a) was chosen such that for any h w(s, ϵ), we have

Lh2D2 s ϵ 128, Lhdαϵ ϵ

64, and, Lh

Combining this observation with the high probability bound on ξ 2 and using the inequality (54) we obtain that T ϵ/16 with probability at least 1 ϵ/16. Plugging this bound in the inequality (52), we ﬁnd that

P π(z) pz(x)

π(x) px(z) exp ϵ

Thus, we have derived a desirable high probability lower bound on the accept-reject ratio. Substituting α = exp( ϵ/16) in the inequality (47) and using the fact that e ϵ/16 1 ϵ/16 for any scalar ϵ > 0, we ﬁnd that

min 1, π(z) pz(x)

8, for any ϵ (0, 1) and h w(s, ϵ).

Substituting this bound in the inequality (46) completes the proof.

5.6. Proof of Theorem 2

The proof of this theorem is similar to the proof of Theorem 1. We begin by claiming that

PMRW(h) x PMRW(h) y TV = ϵ

2 for all x, y such that x y 2 ϵ

PMRW(h) x T MRW(h) x TV = ϵ

8 for all x Rs, (55b)

for any h cϵ2m/(αϵd L2r(s)) for some universal constant c. Plugging s = δ/(2β), ϵ = 1/2 and arguing as in Section 5.2, we ﬁnd that Φ MRW(h) δ/2β c

mh for some universal constant c . Using the convergence rate (22), we obtain that

T k MRW(h)(µ0) Π TV β δ

2β + βe kmh/c δ for all k c

for a suitably large constant c . Substituting h cm/ d L2r(δ/2β) , yields the claimed bound on mixing time of MRW.

Dwivedi, Chen, Wainwright and Yu

It is now left to establish our earlier claims (55a) and (55b). Note that P MRW(h) x = N(x, 2h Id). For brevity, we drop the superscripts from our notations. Using the expression (43) for the KL-divergence and applying Pinsker s inequality leads to the upper bound

2 KL(Px Py) = x y 2

which implies the claim (55a). We now prove the bound (55b). Letting px to denote the density of the proposal distribution Px and using the bounds (46) and (47), it suﬃces to prove that

(i) = Pz Px h f(x) f(z) ϵ

i (1 ϵ/16), (57)

where step (i) follows from the fact that π(x) e f(x). We have

f(x) f(z) (i) f(z) (x z) = ( f(z) f(x)) (x z) + f(x) (x z)

(ii) L x z 2 2 + f(x) (x z)

= 2Lh ξ 2 2 +

2h f(x) ξ (58)

where the step (i) follows from the convexity of the function f, step (ii) the smoothness of the function f (Lemma 9(e)). Note that the random variable χ := f(x) ξ N(0, f(x) 2 2) and that f(x) 2 Ds for any x Rs. Consequently, we have χ Ds 2 p

log(32/ϵ) with probability at least 1 ϵ/32. On the other hand, using the standard tail bound for a Chisquared random variable, we obtain that P h ξ 2 2 dαϵ i ϵ/32 for αϵ = 1+2 p

2 log(32/ϵ). Recalling that Ds = L q

d mr(s) and doing straightforward calculation reveals

that for h ϵ2

(8192αϵd L2

m r(s)), we have

log(32/ϵ) 3ϵ

Combining these bounds with the high probability statements above and plugging in the inequality (58), we ﬁnd that f(x) f(z) ϵ/16 with probability at least 1 ϵ/16, which yields the claim (57).

6. Discussion

In this paper, we derived non-asymptotic bounds on the mixing time of the Metropolis adjusted Langevin algorithm and Metropolized random walk for log-concave distributions. These algorithms are based on a two-phase scheme: (1) a proposal step followed by (2) an accept-reject step. Our results show that the accept-reject step while leading to signiﬁcant complications in the analysis is practically very useful: algorithms applying this step mix signiﬁcantly faster than the ones without it. In particular, we showed that for a strongly log-concave distribution in Rd with condition number κ, the δ-mixing time for MALA is

Metropolis-Hastings algorithms are fast

of O (dκ log(1/δ)). This guarantee signiﬁcantly better than the O dκ2/δ2 mixing time for ULA established in the literature. We also proposed a modiﬁed version of MALA to sample from non-strongly log-concave distributions and showed that it mixes in O d2/δ1.5 ; thus, this algorithm dependency on the desired accuracy δ when compared to the O d3/δ4

mixing time for ULA for the same task. Furthermore, we established O dκ2 log(1/δ)

mixing time bound for the Metropolized random walk for log-concave sampling.

Several fundamental questions arise from our work. All of our results are upper bounds on mixing time, and our simulation results suggest that they are tight for the choice of step size used in the Theorem statements. However, simulations from Section 4.4 suggest that warmness parameter β should not aﬀect the choice of step size too much and hence potentially larger choices of step sizes (and thereby faster mixing) are possible. To this end, in a recent pre-print (Chen et al., 2019), we established faster mixing time bounds for MALA and MRW from a non-warm start where we show that the dependence on warmness can be improved from log β to log log β. Moreover, for a deterministic start, one may consider running ULA for a few steps run to obtain moderate accuracy, and then run MALA initialized with the ULA iterates (thereby providing a warm start to MALA). In practice, we ﬁnd that this hybrid procedure can generate highly accurate samples in reasonably few number of iterations.

Another open question is to sharply delineate the fundamental gap between the mixing times of ﬁrst-order sampling methods and that of zeroth-order sampling methods. Noting that MALA is a ﬁrst-order method while MRW is a zeroth-order method, from our work, we obtain that two class of methods diﬀer in a factor of the condition number κ of the target distribution. It is an interesting question to determine whether this represents a sharp gap between these two classes of sampling methods.

The current state-of-the-art algorithm, namely Hamiltonian Monte Carlo (Neal, 2011) can be seen as a multi-step generalization of MALA. Instead of centering the proposal after one gradient-step, HMC simulates an ODE in an augmented space for a few time steps. Indeed, MALA is equivalent to a particular one-step discretization of the ODE associated with HMC. Nonetheless, the practically used HMC makes use of multi-step discretization and is more involved than MALA. Empirically HMC has proven to have superior mixing times for a broad class of distributions (and not just log-concave distributions). A line of recent work (Bou-Rabee et al., 2018; Mangoubi and Smith, 2017; Mangoubi and Vishnoi, 2018) provides theoretical guarantees for HMC in diﬀerent settings. Several of these works analyze an idealized version of HMC or the discretized version without the accept-reject step. In a recent pre-print (Chen et al., 2019), we have provided some theoretical guarantees on the convergence of Metropolized HMC, which is the most practical version of HMC.

Acknowledgements

This work was supported by Oﬃce of Naval Research grant DOD ONR-N00014 to MJW, and by ARO W911NF1710005, NSF-DMS 1613002 and the Center for Science of Information (CSo I), US NSF Science and Technology Center, under grant agreement CCF-0939370 to BY. In addition, MJW was partially supported by National Science Foundation grant NSFDMS-1612948, and RD was partially supported by the Berkeley Fellowship.

Dwivedi, Chen, Wainwright and Yu

Appendix A. Some basic properties

In this appendix, we state a few basic properties of strongly-convex and smooth functions that we use in our proofs. See the book (Boyd and Vandenberghe, 2004) for more details.

Lemma 8 (Equivalent characterizations of strong convexity) For a twice diﬀerentiable convex function f : Rd 7 R, the following statements are equivalent:

(a) The function f is m-strongly-convex.

(b) The function x 7 f(x) m

2 x x 2 2 is convex (for any ﬁxed point x ).

(c) For any x, y Rd, we have

f(y) f(x) + f(x) (y x) + m

2 x y 2 2 .

(d) For any x, y Rd, we have

f(x) f(y) 2 m x y 2 .

(e) For any x, y Rd, we have

( f(x) f(y)) (x y) m x y 2 2 .

(f) For any x Rd, the Hessian is lower bounded as 2f(x) m Id.

Lemma 9 (Equivalent characterizations of smoothness) For a twice diﬀerentiable convex function f : Rd 7 R, the following statements are equivalent:

(a) The function f is L-smooth.

(b) The function x 7 L

2 x x 2 2 f(x) is convex (for any ﬁxed point x ).

(c) For any x, y Rd, we have

f(y) f(x) + f(x) (y x) + L

2 x y 2 2 .

(d) For any x, y Rd, we have

f(x) f(y) 2 L x y 2 .

(e) For any x, y Rd, we have

( f(x) f(y)) (x y) L x y 2 2 .

(f) For any x Rd, the Hessian is upper bounded as 2f(x) LId.

Metropolis-Hastings algorithms are fast

Claude JP B elisle, H Edwin Romeijn, and Robert L Smith. Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255 266, 1993.

Nawaf Bou-Rabee and Martin Hairer. Nonasymptotic mixing of the MALA algorithm. IMA Journal of Numerical Analysis, 33(1):80 110, 2012.

Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1805.00452, 2018.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Steve Brooks, Andrew Gelman, Galin L Jones, and Xiao-Li Meng. Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, 2011.

S ebastien Bubeck. Convex optimization: algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

S ebastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4): 757 783, 2018.

Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast MCMC sampling algorithms on polytopes. The Journal of Machine Learning Research, 19(1):2146 2231, 2018.

Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of Metropolized Hamiltonian Monte Carlo: Beneﬁts of multi-step gradients. ar Xiv preprint ar Xiv:1905.12247, 2019.

Xiang Cheng and Peter L Bartlett. Convergence of Langevin MCMC in KL-divergence. PMLR 83, (83):186 211, 2018.

Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, pages 300 323, 2018.

Ben Cousins and Santosh Vempala. A cubic algorithm for computing Gaussian volume. In Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on discrete algorithms, pages 1215 1228. Society for Industrial and Applied Mathematics, 2014.

T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley and Sons, New York, 1991.

Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.

Dwivedi, Chen, Wainwright and Yu

Persi Diaconis and David Freedman. On Markov chains with continuous state space. Technical report, 1997.

Alain Durmus, Eric Moulines, and Marcelo Pereyra. Eﬃcient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018.

Alain Durmus, Eric Moulines, et al. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019.

Martin Dyer, Alan Frieze, and Ravi Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM (JACM), 38(1):1 17, 1991.

Andreas Eberle. Error bounds for Metropolis Hastings algorithms applied to perturbations of Gaussian measures in high dimensions. The Annals of Applied Probability, 24(1):337 377, 2014.

Andreas Eberle, Arnaud Guillin, Raphael Zimmer, et al. Couplings and quantitative contraction rates for langevin dynamics. The Annals of Probability, 47(4):1982 2010, 2019.

Alan Frieze, Ravi Kannan, and Nick Polson. Sampling from log-concave distributions. The Annals of Applied Probability, pages 812 837, 1994.

Ulf Grenander and Michael I Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 549 603, 1994.

Gilles Harg e. A convex/log-concave correlation inequality for Gaussian measure and an application to abstract Wiener spaces. Probability theory and related ﬁelds, 130(3):415 440, 2004.

W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, 1970.

Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.

Søren Fiig Jarner and Ernst Hansen. Geometric ergodicity of Metropolis algorithms. Stochastic processes and their applications, 85(2):341 361, 2000.

Ravi Kannan, L aszl o Lov asz, and Mikl os Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(1):541 559, 1995.

L aszl o Lov asz. Hit-and-run mixes fast. Mathematical Programming, 86(3):443 461, 1999.

L aszl o Lov asz and Mikl os Simonovits. The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In Proceedings of 31st Annual Symposium on Foundations of Computer Science, 1990, pages 346 354. IEEE, 1990.

Metropolis-Hastings algorithms are fast

L aszl o Lov asz and Mikl os Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures & Algorithms, 4(4):359 412, 1993.

L aszl o Lov asz and Santosh Vempala. Hit-and-run from a corner. SIAM Journal on Computing, 35(4):985 1005, 2006.

L aszl o Lov asz and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. ar Xiv preprint ar Xiv:1708.07114, 2017.

Oren Mangoubi and Nisheeth Vishnoi. Dimensionally tight bounds for second-order Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 6027 6037, 2018.

Kerrie L Mengersen, Richard L Tweedie, et al. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24(1):101 121, 1996.

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953.

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012.

Sean P Meyn and Robert L Tweedie. Computable bounds for geometric convergence rates of Markov chains. The Annals of Applied Probability, pages 981 1011, 1994.

Hariharan Narayanan. Randomized interior point methods for sampling and optimization. The Annals of Applied Probability, 26(1):597 641, 2016.

Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.

G Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3): 378 384, 1981.

Marcelo Pereyra. Proximal Markov chain Monte Carlo algorithms. Statistics and Computing, 26(4):745 760, 2016.

Natesh S Pillai, Andrew M Stuart, and Alexandre H Thi ery. Optimal scaling and diﬀusion limits for the Langevin algorithm in high dimensions. The Annals of Applied Probability, 22(6):2320 2356, 2012.

Christian P Robert. Monte Carlo methods. Wiley Online Library, 2004.

Gareth O Roberts and Jeﬀrey S Rosenthal. Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4):351 367, 2001.

Dwivedi, Chen, Wainwright and Yu

Gareth O Roberts and Jeﬀrey S Rosenthal. Complexity bounds for MCMC via diﬀusion limits. ar Xiv preprint ar Xiv:1411.0712, 2014.

Gareth O Roberts and Osnat Stramer. Langevin diﬀusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4(4):337 357, 2002.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341 363, 1996a.

Gareth O Roberts and Richard L Tweedie. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95 110, 1996b.

Gareth O Roberts, Jeﬀrey S Rosenthal, et al. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20 71, 2004.

Denis Talay and Luciano Tubaro. Expansion of the global error for numerical schemes solving stochastic diﬀerential equations. Stochastic analysis and applications, 8(4):483 509, 1990.

Santosh Vempala. Geometric random walks: a survey. Combinatorial and Computational Geometry, 52(573-612):2, 2005.

Tatiana Xifara, Chris Sherlock, Samuel Livingstone, Simon Byrne, and Mark Girolami. Langevin diﬀusions and the Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91:14 19, 2014.