# sqrtd_dimension_dependence_of_langevin_monte_carlo__790b622e.pdf Published as a conference paper at ICLR 2022 SQRT(D) DIMENSION DEPENDENCE OF LANGEVIN MONTE CARLO Ruilin Li Georgia Institute of Technology ruilin.li@gatech.edu Hongyuan Zha School of Data Science Shenzhen Institute of Artificial Intelligence and Robotics for Society The Chinese University of Hong Kong, Shenzhen zhahy@cuhk.edu.cn Molei Tao Georgia Institute of Technology mtao@gatech.edu This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an e O d/ϵ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known e O d/ϵ result and is optimal (in terms of order) in both dimension d and accuracy tolerance ϵ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments. 1 INTRODUCTION The problem of sampling statistical distributions has attracted considerable attention, not only in the fields of statistics and scientific computing, but also in machine learning (Robert & Casella, 2013; Andrieu et al., 2003; Liu, 2008); for example, how various sampling algorithms scale with the dimension of the target distribution is a popular recent topic in statistical deep learning (see discussions below for references). For samplers that can be viewed as discretizations of SDEs, the idea is to use an ergodic SDE whose equilibrium distribution agrees with the target distribution, and employ an appropriate numerical algorithm that discretizes (the time of) the SDE. The iterates of the numerical algorithm will approximately follow the target distribution when converged, and can be used for various downstream applications such as Bayesian inference and inverse problem (Dashti & Stuart, 2017). One notable example is the Langevin Monte Carlo algorithm (LMC), which corresponds to an Euler-Maruyama discretization of the overdamped Langevin equation. Its study dated back to at least the 90s (Roberts et al., 1996) but keeps on leading to important discoveries, for example, on non-asymptotics and dimension dependence, which are relevant to machine learning (e.g., Dalalyan (2017a;b); Cheng et al. (2018a); Durmus et al. (2019); Durmus & Moulines (2019); Vempala & Wibisono (2019); Dalalyan & Riou-Durand (2020); Li et al. (2019); Erdogdu & Hosseinzadeh (2021); Mou et al. (2019); Lehec (2021)). LMC is closely related to SGD too (e.g., Mandt et al. (2017)). Many other examples exist, based on alternative SDEs and/or different discretizations (e.g., Dalalyan & Riou-Durand (2020); Ma et al. (2021); Mou et al. (2021); Li et al. (2020); Roberts & Rosenthal (1998); Chewi et al. (2021); Shen & Lee (2019)). Quantitatively characterizing the non-asymptotic sampling error of numerical algorithms is usually critical for choosing the appropriate algorithm for a specific downstream application, for providing practical guidance on hyperparameter selection and experiment design, and for designing improved samplers. A powerful tool that dates back to (Jordan et al., 1998) is a paradigm of non-asymptotic Published as a conference paper at ICLR 2022 error analysis, namely to view sampling as optimization in probability space, and it led to many important recent results (e.g., Liu & Wang (2016); Dalalyan (2017a); Wibisono (2018); Zhang et al. (2018); Frogner & Poggio (2020); Chizat & Bach (2018); Chen et al. (2018); Ma et al. (2021); Erdogdu & Hosseinzadeh (2021)). It works by choosing an objective functional, typically some statistical distances/divergences, and showing that the law of the iterates of sampling algorithms converges in that objective functional. However, the choice of the objective functional often needs to be customized for different sampling algorithms. For example, KL divergence works for LMC (Cheng & Bartlett, 2018), but a carefully hand-crafted cross term needs to be added to KL divergence for analyzing KLMC (Ma et al., 2021). Even for the same underlying SDE, different discretization schemes exist and lead to different sampling algorithms, and the analyses of them had usually been case by case (e.g., Cheng et al. (2018b); Dalalyan & Riou-Durand (2020); Shen & Lee (2019)). Therefore, it would be a desirable complement to have a unified, general framework to study the non-asymptotic error of SDE-based sampling algorithms. Toward this goal, an alternative approach to analysis has recently started attracting attention, namely to resort to the numerical analysis of SDE integrators (e.g., Milstein & Tretyakov (2013); Kloeden & Platen (1992)) and quantitatively connect the integration error to the sampling error. One remarkable work in this direction is Li et al. (2019), which will be discussed in greater details later on. The main tool of analysis in this paper will be a strengthened version (in specific aspects that will be clarified soon) of the result in Li et al. (2019). Although this analysis framework is rather general and applicable to a broad family of numerical methods that discretize contractive1 SDEs, the main innovation focuses on a specific sampling algorithm, namely LMC, which is widely used in practice. Its stochastic gradient version is implemented in common machine learning systems, such as Tensorflow (Abadi et al., 2016), and is the off-the-shelf algorithm for large scale Bayesian inference. With the ever-growing size of parameter space, the non-asymptotic error of LMC is of central theoretical and practical interest, in particular, its dependence on the dimension of the sample space. The best current known upper bound of the mixing time in 2-Wasserstein distance for LMC is e O d ϵ (Durmus & Moulines, 2019). Motivated by a recent result (Chewi et al., 2021) that shows better dimension dependence for a Metropolis-Adjusted improvement of LMC, we will investigate if the current bound for (unadjusted) LMC is tight, and if not, what is the optimal dimension dependence. Our contributions The main contribution of this work is an improved e O d ϵ mixing time upper bound for LMC in 2-Wasserstein distance, under reasonable regularity assumptions. More specifically, we study LMC for sampling from a Gibbs distribution dµ exp f(x) dx. Under the standard smoothness and strong-convexity assumptions, plus an additional linear growth condition on the third-order derivative of the potential (which also shares connections to popular assumptions in the frontier literature), our bound improves upon the previously best known e O d ϵ result (Durmus & Moulines, 2019) in terms of dimension dependence. For a comparison, note it was known that discretized kinetic Langevin dynamics can lead to d dependence on dimension (Cheng & Bartlett, 2018; Dalalyan & Riou-Durand, 2020) and some believe that it is the introduction of momentum that improves the dimension dependence, but our result shows that discretized overdamped Langevin (no momentum) can also have mixing time scaling like d. In fact, it is important to mention that recently shown was that Metropolis-Adjusted Euler-Maruyama discretization of overdamped Langevin (i.e., MALA) has an optimal dimension dependence of e O d (Chewi et al., 2021), while what we analyze here is the unadjusted version (i.e., LMC), and it has the same dimension dependence (note however that our ϵ dependence is not as good as that for MALA; more discussion in Section 4). We also constructed an example which shows that the mixing time of LMC is at least eΩ d ϵ . Hence, our mixing time bound has the optimal dependence on both d and ϵ, in terms of order, for the family of target measures satisfying those regularity assumptions. Our theoretical analysis is further validated by empirical investigation of numerical examples. A minor contribution of this work is the error analysis framework that we use. It is based on the classical mean-square analysis (Milstein & Tretyakov, 2013) in numerical SDE literature, however extended from finite time to infinite time. It is a minor contribution because this extension was 1possibly after a coordinate transformation Published as a conference paper at ICLR 2022 already pioneered in the milestone work of Li et al. (2019), although we will develop a refined version. Same as in classical mean-square analysis and in Li et al. (2019), the final (sampling in this case) error is only half order lower than the order of local strong integration error (p2). This will lead to a e O C 2 mixing time upper bound in 2-Wasserstein distance for the family of algorithms, where C is a constant containing various information of the underlying problem, e.g., the dimension d. Nevertheless, the following two are new to this paper: (i) We weakened the requirement on local strong and weak errors. More precisely, Li et al. (2019) requires uniform bounds on local errors, but this could be a nontrivial requirement for SDE integrators; the improvement here only requires non-uniform bounds (although establishing the same result consequently needs notably more efforts, these are included in this paper too). (ii) The detailed expressions of our bounds are not the same as those in Li et al. (2019) (even if local errors could be uniformly bounded), and as we are interested in dimension-dependence of LMC, we work out constants and carefully track their dimension-dependences. Bounds and constants in Li et al. (2019) might not be specifically designed for tightly tracking dimension dependences, as the focus of their seminal paper was more on ϵ dependence; consequently, its general error bound only led to a O(d)-dependence in mixing time when applied to LMC (see Example 1 in Li et al. (2019)), whereas our result leads to O( 2 PRELIMINARIES Notation Use symbol x to denote a d-dim. vector, and plain symbol x to denote a scalar variable. x denotes the Euclidean vector norm. A numerical algorithm is denoted by A and its k-th iterate by xk. Slightly abuse notation by identifying measures with their density function w.r.t. Lebesgue measure. Use the convention e O ( ) = O( ) log O(1)( ) and eΩ( ) = Ω( ) log O(1)( ), i.e., the e O ( )/eΩ( ) notation ignores the dependence on logarithmic factors. Use the notation eΩ( ) similarly. Denote 2-Wasserstein distance by W2(µ1, µ2) = inf(X,Y ) Π(µ1,µ2) E X Y 2 1 2 , where Π(µ1, µ2) is the set of couplings, i.e. all joint measures with X and Y marginals being µ1 and µ2. Denote the target distribution by µ and the law of a random variable X by Law(X). Finally, denote the mixing time of an sampling algorithm A converging to its target distribution µ in 2-Wasserstein distance by τmix(ϵ; W2; A) = inf{k 0|W2(Law( xk), µ) ϵ}. SDE for Sampling Consider a general SDE dxt = b(t, xt)dt + σ(t, xt)d Bt (1) where b Rd is a drift term, σ Rd l is a diffusion coefficient matrix and Bt is a l-dimensional Wiener process. Under mild condition (Pavliotis, 2014, Theorem 3.1), there exists a unique strong solution xt to Eq. (1). Some SDEs admit geometric ergodicity, so that their solutions converge exponentially fast to a unique invariant distribution, and examples include the classical overdamped and kinetic Langevin dynamics, but are not limited to those (e.g., Mou et al. (2021); Li et al. (2020)). Such SDE are desired for sampling purposes, because one can set the target distribution to be the invariant distribution by choosing an SDE with an appropriate potential, and then solve the solution xt of the SDE and push the time t to infinity, so that (approximate) samples of the target distribution can be obtained. Except for a few known cases, however, explicit solutions of Eq. (1) are elusive and we have to resort to numerical schemes to simulate/integrate SDE. Such example schemes include, but are not limited to Euler-Maruyama method, Milstein methods and Runge-Kutta method (e.g., Kloeden & Platen (1992); Milstein & Tretyakov (2013)). With constant stepsize h and at k-th iteration, a typical numerical algorithm takes a previous iterate xk 1 and outputs a new iterate xk as an approximation of the solution xt of Eq. (1) at time t = kh. Langevin Monte Carlo Algorithm LMC algorithm is defined by the following update rule xk = xk 1 h f( xk 1) + 2hξk, k = 1, 2, (2) where {ξk}k Z>0 are i.i.d. standard d-dimensional Gaussian vectors. LMC corresponds to an Euler Maruyama discretization of the continuous overdamped Langevin dynamics dxt = f(xt)dt + 2d Bt, which converges to an equilibrium distribution µ exp( f(x)). Published as a conference paper at ICLR 2022 Dalalyan (2017b) provided a non-asymptotic analysis of LMC. An e O d ϵ2 mixing time bound in W2 for log-smooth and log-strongly-convex target measures (Dalalyan, 2017a; Cheng et al., 2018a; Durmus et al., 2019) has been established. It was further improved to e O d ϵ under an additional Hessian Lipschitz condition (Durmus & Moulines, 2019). Mixing time bounds of LMC in other statistical distances/divergences have also been studied, including total variation distance (Dalalyan, 2017b; Durmus & Moulines, 2017) and KL divergence (Cheng & Bartlett, 2018). Classical Mean-Square Analysis A powerful framework for quantifying the global discretization error of a numerical algorithm for Eq. (1), i.e., ek = E xkh xk 1 2 , is mean-square analysis (e.g., Milstein & Tretyakov (2013)). Mean-square analysis studies how local integration error propagate and accumulate into global integration error; in particular, if one-step (local) weak error and strong error (both the exact and numerical solutions start from the same initial value x) satisfy Exh E x1 C1 1 + E x 2 1 2 hp1, (local weak error) E xh x1 2 1 2 C2 1 + E x 2 1 2 hp2, (local strong error) (3) over a time interval [0, Kh] for some constants C1, C2 > 0, p2 1 2 and p1 p2 + 1 2, then the global error is bounded by ek C 1 + E x0 2 1 2 , k = 1, , K for some constant C > 0 dependent on Kh. Although classical mean-square analysis is only concerned with numerical integration error, sampling error can be also inferred. However, there is a limitation that prevents its direct employment in analyzing sampling algorithms: the global error bound only holds in finite time because the constant C can grow exponentially as K increases, rendering the bound useless when K . 3 MEAN-SQUARE ANALYSIS OF SAMPLERS BASED ON CONTRACTIVE SDE We now review and present some improved results on how to use mean-square analysis of integration error to quantify sampling error. A seminal paper in this direction is Li et al. (2019). What is known / new will be clarified. In all cases, the first step is to lift the finite time limitation when the SDE being discretized has some decaying property so that local integration errors do not amplify with time. The specific type of decaying property we will work with is contractivity (after coordinate transformation). It is a sufficient condition for the underlying SDE to converge to a statistical distribution. Definition 3.1. A stochastic differential equation is contractive if there exists a non-singular constant matrix A Rd d, a constant β > 0, such that any pair of solutions of the SDE satisfy E A (xt yt) 2 1 2 E A (x y) 2 1 2 exp( βt), (4) where xt, yt are two solutions, driven by the same Brownian motion but evolved respectively from initial conditions x and y. Remark. As long as b and σ in (1) are not explicitly dependent on time, it suffices to find an arbitrarily small t0 > 0 and show (4) holds for all t < t0. Remark. Sometimes contraction is not easy to establish directly, but can be shown after an appropriate coordinate transformation, see (Dalalyan & Riou-Durand, 2020, Proposition 1) for such a treatment for kinetic Langevin dynamics. The introduction of A permits such transformations. In particular, overdamped Langevin dynamics, of which LMC is a discretization, is contractive when f is strongly convex and smooth. We now use contractivity to remove the finite time limitation. We first need a short time lemma. Lemma 3.2. (Milstein & Tretyakov, 2013, Lemma 1.3) Suppose b and σ in Eq.(1) are Lipschitz continuous. For two solutions xt, yt of Eq. (1) starting from x, y respectively, denote zt(x, y) := (xt x) (yt y), then there exist C0 > 0 and h0 > 0 such that E zt(x, y) 2 C0 x y 2 t, x, y, 0 < t h0. (5) Published as a conference paper at ICLR 2022 Then we have a sequence of results that connects statistical property with integration property. We will see that a non-asymptotic sampling error analysis only requires bounding the orders of local weak and strong integration errors (if the continuous dynamics can be shown contractive). Theorem 3.3. (Global Integration Error, Infinite Time Version) Suppose Eq.(1) is contractive with rate β and with respect to a non-singular matrix A Rd d, with Lipschitz continuous b and σ, and there is a numerical algorithm A with step size h simulating the solution xt of the SDE, whose iterates are denoted by xk, k = 0, 1, . Suppose there exists 0 < h0 1, C1, C2 > 0, D1, D2 0, p1 1, 1 2 < p2 p1 1 2 such that for any 0 < h h0, the algorithm A has, respectively, local weak and strong error of order p1 and p2, defined as E (xh x1) C1 + D1 q E x 2 hp1, E xh x1 2 1 2 C2 2 + D2 2E x 2 1 where xh solves Eq.(1) with any initial value x and x1 is the result of applying A to x for one step. If the solution of SDE xt and algorithm A both start from x0, then for 0 < h h1 2κ2 A(D1+C0D2) , the global error ek is bounded as ek := E xkh xk 2 1 2 , k = 0, 1, 2, , where (7) C = 2 β κ2 A C1 + C0C2 + 2U(D1 + C0D2) β + C2 + C0 is from Eq. (5), κA is the condition number of matrix A and U 2 4E x0 2 + 6Eµ x 2. Remark (what s new). Thm.3.3 refines the seminal results in Li et al. (2019) in the sense that it only requires non-uniform bounds on the local error (6), whereas Li et al. (2019) requires uniform bounds, i.e., D1 = D2 = 0 in (6). Therefore, the refinement we present has wider applicability. In general, local errors tend to depend on the current step s value, i.e. D1 = 0, D2 = 0. Allowing local error bounds to be non-uniform enabled applications such as proving the vanishing bias of mirror Langevin algorithm (Li et al., 2021). For a simpler illustration, consider LMC for 1D standard Gaussian target distribution, then we have E (xh x1) = (e h 1 + h)E x = ( h2 2 +o(h2))E x . One can see that the local error does depend on x and is not uniform. Meanwhile, our non-uniform condition still holds because h2 2 + o(h2) E x ( h2 2 + o(h2)) p E x 2 (and thus p1 = 2). Note if the discretization does converge to a neighborhood of the target distribution, it is possible that E x 2 and/or E x become bounded near the convergence, and in this case the D1, D2 parts can be absorbed into C1 and C2; however, this if clause is exactly what we d like to prove. Nevertheless, we state for rigor that the convention 1/0 = is used when D1 = D2 = 0. Another remark is, even in this case, our bound has a different expression from the seminal results. We will carefully work out, track, and combine dimension-dependence of constants using our bound. Following Theorem 3.3, we obtain the following non-asymptotic bound of the sampling error in W2: Theorem 3.4. (Non-Asymptotic Sampling Error Bound: General Case) Under the same assumption and with the same notation of Theorem 3.3, we have W2(Law( xk), µ) e βkh W2(Law(x0), µ) + Chp2 1 2 , 0 < h h1. A corollary of Theorem 3.4 is a bound on the mixing time of the sampling algorithm: Corollary 3.5. (Upper Bound of Mixing Time: General Case) Under the same assumption and with the same notation of Theorem 3.3, we have τmix(ϵ; W2; A) max log 2W2(Law(x0), µ) Published as a conference paper at ICLR 2022 In particular, when high accuracy is needed, i.e., ϵ < 2Ch p2 1 2 1 , we have τmix(ϵ; W2; A) (2C) 2 log 2W2(Law(x0), µ) Corollary 3.5 states how mixing time depends on the order of local (strong) error (i.e., p2) of a numerical algorithm. The larger p2 is, the shorter the mixing time of the algorithm is, in term of the dependence on accuracy tolerance parameter ϵ. It is important to note that for constant stepsize discretizations that are deterministic on the filtration of the driving Brownian motion and use only its increments, there is a strong order barrier, namely p2 1.5 (Clark & Cameron, 1980; R uemelin, 1982); however, methods involving multiple stochastic integrals (e.g., Kloeden & Platen (1992); Milstein & Tretyakov (2013); R oßler (2010)) can yield a larger p2, and randomization (e.g., Shen & Lee (2019)) can possibly break the barrier too. The constant C defined in Eq. (7) typically contains rich information about the underlying SDE, e.g. dimension, Lipschitz constant of drift and noise diffusion, and the initial value x0 of the sampling algorithm. Through C, we can uncover the dependence of mixing time bound on various parameters, such as the dimension d. This will be detailed for Langevin Monte Carlo in the next section. It is worth clarifying that once Thm.3.3 is proved, establishing Theorem 3.4 and Corollary 3.5 is relatively easy. In fact, analogous results have already been provided in Li et al. (2019), although they also required uniform local errors as consequences of their Thm.1. Nevertheless, we do not claim novelty in Theorem 3.4 and Corollary 3.5 and they are just presented for completeness. Our main refinement is just Thm.3.3 over Thm.1 in Li et al. (2019), and the non-triviality lies in its proof. 4 NON-ASYMPTOTIC ANALYSIS OF LANGEVIN MONTE CARLO ALGORITHM We now quantify how LMC samples from Gibbs target distribution µ exp f(x) that has a finite second moment, i.e., R Rd x 2 dµ < . Assume without loss of generality that the origin is a local minimizer of f, i.e. f(0) = 0; this is for notational convenience in the analysis and can be realized via a simple coordinate shift, and it is not needed in the practical implementation. In addition, we assume the following two conditions hold: A 1. (Smoothness and Strong Convexity) Assume f C2 and is L-smooth and m-strongly-convex, i.e. there exists 0 < m L such that m Id 2f(x) LId, x Rd. Denote the condition number of f by κ L m. The smoothness and strong-convexity assumption is the standard assumption in the literature of analyzing LMC algorithm (Dalalyan, 2017a;b; Cheng & Bartlett, 2018; Durmus et al., 2019; Durmus & Moulines, 2019). A 2. (Linear Growth of the 3rd-order Derivative) Assume f C3 and the operator ( f) grows at most linearly, i.e., there exists a constant G > 0 such that ( f(x)) G 1 + x . Remark. The linear growth (at infinity) condition on f is actually not as restrictive as it appears, and in some sense even weaker than some classical condition for the existence of solutions to SDE. For example, a standard condition for ensuring the existence and uniqueness of a global solution to SDE is at most a linear growth (at infinity) of the drift (Pavliotis, 2014, Theorem 3.1). If we consider monomial potentials, i.e., f(x) = xp, p N+, then the linear growth condition on f is met when p 4, whereas the classical condition for the existence of solutions holds only when p 2. Remark. Another additional assumption, namely Hessian Lipschitz condition, is commonly used in the literature (e.g., Durmus & Moulines (2019); Ma et al. (2021)). It requires the existence of a constant L, such that 2f(y) 2f(x) L y x . It can be shown that smoothness and Hessian Lipschitzness imply A2. Meanwhile, examples that satisfy A2 but are not Hessian Lipschitz exist, e.g., f(x) = x4, and thus A2 is not necessarily stronger than Hessian Lipschitzness. Remark. Same as L and m in A1, we implicitly assume the constant G introduced in A2 to be independent of dimension. Meanwhile, it is important to note examples for which G depends on the dimension do exist, and this is also true for other regularity constants including not only L and m but also the Hessian Lipschitz constant L. This part of the assumption is a strong one. Remark. A2, together with Ito s lemma, helps establish an order p1 = 2 of local weak error for LMC (see Lemma D.2), which enables us to obtain the d dependence. Published as a conference paper at ICLR 2022 To apply mean-square analysis to study LMC algorithm, we will need to ensure the underlying Langevin dynamics is contractive, which we verify in Section C and D in the appendix. In addition, we work out all required constants to determine the C in Eq. 7 explicitly in the appendix. With all these necessary ingredients, we now invoke Theorem 3.4 and obtain the following result: Theorem 4.1. (Non-Asymptotic Error Bound: LMC) Suppose Assumption 1 and 2 hold. LMC iteration xk+1 = xk h f( xk) + 2hξk satisfies W2(Law( xk), µ) e mkh W2(Law(x0), µ) + CLMCh, 0 < h 1 4κL, k N (10) where CLMC = 10(L2+G) 2d + m E x0 2 + 1 = O( Corollary 3.5 combined with the above result gives the following bound on the mixing time of LMC: Theorem 4.2. (Upper Bound of Mixing Time: LMC) Suppose Assumption 1 and 2 hold. If running LMC from x0, we then have τmix(ϵ; W2; LMC) max 4κ2, 2CLMC log 2W2(Law(x0), µ) where CLMC is the same in Theorem 4.1. When high accuracy is needed, i.e., ϵ CLMC 2mκ2 , we have τmix(ϵ; W2; LMC) 2CLMC m 1 ϵ log 2W2(Law(x0), µ) d ϵ mixing time bound in W2 distance improves upon the previous ones (Dalalyan, 2017a; Cheng & Bartlett, 2018; Durmus & Moulines, 2019; Durmus et al., 2019) in the dependence of d and/or ϵ. If further assuming G = O(L2), we then have CLMC = O(κ2 m d) and Thm.4.2 shows the mixing time is e O κ2 m d ϵ , which also improves the κ dependence in some previous results (Dalalyan, 2017a; Cheng & Bartlett, 2018) in the m 1 regime. A brief summary is in Table 1. Table 1: Comparison of mixing time results in 2-Wassertein distance of LMC with L-smooth and m-strongly-convex potential. Constant step size is used and accuracy tolerance ϵ is small enough. mixing time Additional Assumption (Dalalyan, 2017a, Theorem 1) e O κ2 (Cheng & Bartlett, 2018, Theorem 1) e O κ2 (Durmus et al., 2019, Corollary 10) e O κ m d (Durmus & Moulines, 2019, Theorem 8) e O d ϵ 1 2f(x) 2f(y) e L x y This work (Theorem 4.2) e O κ2 m Assumption 2 and G = O(L2) 2 Remark (more comparison). The seminal work of Li et al. (2019) provided mean-square analysis (their Thm.1) and obtained a e O d ϵ mixing time bound for LMC (their Example 1) under smoothness, strong convexity and Hessian Lipschitz conditions, consistent with that in Durmus & Moulines (2019). By using our version (Thm.3.3) and tracking down constants dimension-dependence, we are able to tighten it to e O d ϵ . Worth clarifying is, dimension-dependence might not be the focus of Li et al. (2019); instead, it considered ϵ-dependence and other discretizations, and showed for example that 1.5 SRK discretization has improved mixing time bound of e O d ϵ2/3 . The dimension dependence of this discretization, for example, can possibly be improved by our results too. 1The dependence on κ is not readily available from Theorem 8 in Durmus & Moulines (2019). 2The G = O(L2) assumption is only for κ, m dependence. Removing it doesn t affect d, ϵ dependence. Published as a conference paper at ICLR 2022 Optimality In fact, the e O d ϵ mixing time of LMC has the optimal scaling one can expect. This is in terms of dependence on d and ϵ, over the class of all log-smooth and log-strongly-convex target measures. To illustrate this, consider the following Gaussian target distribution whose potential is i=1 x2 i + L i=d+1 x2 i , with m = 1, L 4m. (11) We now establish a lower bound on the mixing time of LMC algorithm for this target measure. Theorem 4.3. (Lower Bound of Mixing Time) Suppose we run LMC for the target measure defined in Eq. (11) from x0 = 12d, then for any choice of step size h > 0 within stability limit, we have τmix(ϵ; W2; LMC) Combining Theorem 4.2 and 4.3, we see that mean-square analysis provides a tight bound for LMC and e O d ϵ is the optimal scaling of LMC for target measures satisfying Assumptions 1 and 2. Note that the above optimality results only partially, but not completely, close the gap between the upper and lower bounds of LMC over the entire family of log-smooth and log-strongly-convex target measures, because of one limitation of our result A(ssumption)2 is, despite of its close relation to the Hessian Lipschitz condition frequently used in the literature, still an extra condition. We tend to believe that A2 may not be essential, but rather an artifact of our proof technique. However, at this moment we cannot eliminate the possibility that the best scaling one can get out of Assumption 1 only (no A2) is worse than e O d/ϵ . We d like to further investigate this in future work. Discussion Besides Li et al. (2019) (see the previous Remark), let s briefly discuss three more important sampling algorithms related to LMC. Two of them are Kinetic Langevin Monte Carlo (KLMC) and Randomized Midpoint Algorithm (RMA), both of which are discretizations of kinetic Langevin dynamics. The other is Metropolis-Adjusted Langevin Algorithm (MALA) which uses the one-step update of LMC as a proposal and then accepts/rejects it with Metropolis-Hastings. d ϵ mixing time in 2-Wasserstein distance of KLMC has been established for log-smooth and log-strongly-convex target measures in existing literature (Cheng et al., 2018b; Dalalyan & Riou Durand, 2020) and that was a milestone. Due to its better dimension dependence over previously best known results of LMC, KLMC is understood to be the analog of Nesterov s accelerated gradient method for sampling (Ma et al., 2021). Our findings show that LMC is able to achieve the same mixing time, although under an additional growth-at-infinity condition. However, this does not say anything about whether/how KLMC accelerates LMC, as the existing KLMC bound may still be not optimal. We also note KLMC has better condition number dependence than our current LMC result, although the κ dependence in our bound may not be tight. RMA (Shen & Lee, 2019) is based on a brilliant randomized discretization of kinetic Langevin dynamics and shown to have further improved dimension dependence (and other pleasant properties). From the perspective of this work, we think it is because RMA is able to break the strong order barrier due to the randomization, and more investigations based on mean-square analysis should be possible. For MALA, a recent breakthrough (Chewi et al., 2021) establishes a e O d mixing time in W2 distance with warm start, and the dimension dependence is shown to be optimal. We see that without the Metropolis adjustment, LMC (under additional assumptions such as A2) can also achieve the same dimension dependence as MALA. But unlike LMC, MALA only has logarithmic dependence on 1 ϵ . With warm-start, is it possible/how to improve the dependence of 1 ϵ for LMC, from polynomial to logarithmic? This question is beyond the scope of this paper but worth further investigation. 5 NUMERICAL EXAMPLES This section numerically verifies our theoretical findings for LMC in Section 4, with a particular focus on the dependence of the discretization error in Theorem 4.1 on dimension d and step size h. Published as a conference paper at ICLR 2022 To this end, we consider two target measures specified by the following two potentials: 2 x 2 + log and f2(x) = 1 i=1 cos d 1 4 xi . (12) It is not hard to see f1 is 2-smooth and 1-strongly convex, f2 is 3 2-smooth and 1-strongly-convex, and both satisfy Assumption 2. f2 is also used in (Chewi et al., 2021) to illustrate the optimal dimension dependence of MALA. Explicit expression of 2-Wasserstein distance between non-Gaussian distributions is typically not available, instead, we use the Euclidean norm of the mean error as a surrogate because E xk Eµx W2(Law( xk), µ) due to Jensen s inequality. To obtain an accurate estimate of the ground truth, we run 108 independent LMC realizations using a tiny step size (h = 0.001), each till a fixed, long enough time, and use the empirical average to approximate Eµx. To study the dimension dependence of sampling error, we fix step size h = 0.1, and for each d {1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}, we simulate 104 independent Markov chains using LMC algorithm for 100 iterations, which is long enough for the chain to be well-mixed. The mean and the standard deviation of the sampling error corresponding to the last 10 iterates are recorded. To study step size dependence of sampling error, we fix d = 10 and experiment with step size h {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 10 1. We run LMC till T = 20, i.e., T h iterations for each h. The procedure is repeated 104 times with different random seeds to obtain independent samples. When the corresponding continuous time t = kh > 10, we see from Eq. (10) that LMC is well converged and the sampling error is saturated by the discretization error. Therefore, for each h, we take the last 10 h iterates and record the mean and standard deviation of their sampling error. 100 101 102 103 Dimension (a) f1: d dependence 100 101 102 103 Dimension (b) f2: d dependence 0.2 0.4 0.6 0.8 1.0 Step size (c) f1: h dependence 0.2 0.4 0.6 0.8 1.0 Step size (d) f2: h dependence Figure 1: Dependence of the sampling error of LMC on dimension d and step size h for f1 and f2. Both axes in Fig.1a & 1b are in log scale. Shaded areas in Fig.1a & 1b represent one std. of the last 10 iterations. Shaded areas in Fig.1c & 1d represent one std. of the last 10 h iterations. Results shown in Fig.1 are consistent with our theoretical analysis of the sampling error. Both linear dependence on d and h can be supported by the empirical evidence. Note results with smaller h are less accurate because one starts to see the error of empirical approximation due to finite samples. 6 CONCLUSION Via a refined mean-square analysis of Langevin Monte Carlo algorithm, we obtain an improved and optimal e O d/ϵ bound on its mixing time, which was previously thought to be obtainable only with the addition of momentum. This was under the standard smoothness and strongly-convexity assumption, plus an addition linear growth condition on the third-order derivative of the potential function, similar to Hessian Lipschitz condition already popularized in the literature. Here are some possible directions worth further investigations. (i) Combine mean-square analysis with stochastic gradient analysis to study SDE-based stochastic gradient MCMC methods; (ii) Is it still possible to obtain d-dependence without A2, i.e., only under log-smooth and log-stronglyconvex conditions? (iii) Applications of mean-square analysis to other SDEs and/or discretizations; (iv) Motivated by Chewi et al. (2021), it would be interesting to know whether the dependence on 1 ϵ can be improved to logarithmic, for example if LMC is initialized at a warm start. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS The authors thank Andre Wibisono, Chenxu Pang, anonymous reviewers and area chair for suggestions that significantly improved the quality of this paper. MT was partially supported by NSF DMS1847802 and ECCS-1936776. This work was initiated when HZ was a professor and RL was a Ph D student at Georgia Institute of Technology. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265 283, 2016. Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1):5 43, 2003. Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable bayesian sampling. In The Conference on Uncertainty in Artificial Intelligence, 2018. Xiang Cheng and Peter L Bartlett. Convergence of langevin mcmc in kl-divergence. PMLR 83, (83): 186 211, 2018. Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp convergence rates for langevin dynamics in the nonconvex setting. ar Xiv preprint ar Xiv:1805.01648, 2018a. Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped langevin mcmc: A non-asymptotic analysis. Proceedings of the 31st Conference On Learning Theory, PMLR, 2018b. Sinho Chewi, Chen Lu, Kwangjun Ahn, Xiang Cheng, Thibaut Le Gouic, and Philippe Rigollet. Optimal dimension dependence of the metropolis-adjusted langevin algorithm. COLT, 2021. Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pp. 3036 3046, 2018. John MC Clark and RJ Cameron. The maximum rate of convergence of discrete approximations for stochastic differential equations. In Stochastic differential systems filtering and control, pp. 162 171. Springer, 1980. Arnak S Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. Conference on Learning Theory, pp. 678 689, 2017a. Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 651 676, 2017b. Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 1988, 2020. Masoumeh Dashti and Andrew M. Stuart. The Bayesian Approach to Inverse Problems, pp. 311 428. Springer International Publishing, Cham, 2017. ISBN 978-3-319-12385-1. doi: 10.1007/ 978-3-319-12385-1 7. Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. Annals of Applied Probability, 27(3):1551 1587, 2017. Published as a conference paper at ICLR 2022 Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019. Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20:73 1, 2019. Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness. COLT, 2021. Charlie Frogner and Tomaso Poggio. Approximate inference with wasserstein gradient flows. In International Conference on Artificial Intelligence and Statistics, 2020. Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998. Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations. Springer, 1992. Joseph Lehec. The langevin monte carlo algorithm in the non-smooth log-concave case. ar Xiv preprint ar Xiv:2101.10695, 2021. Ruilin Li, Hongyuan Zha, and Molei Tao. Hessian-free high-resolution nesterov acceleration for sampling. ar Xiv preprint ar Xiv:2006.09230, 2020. Ruilin Li, Molei Tao, Santosh S Vempala, and Andre Wibisono. The mirror langevin algorithm converges with vanishing bias. ar Xiv preprint ar Xiv:2109.12077, 2021. Xuechen Li, Denny Wu, Lester Mackey, and Murat A Erdogdu. Stochastic Runge-Kutta accelerates Langevin Monte Carlo and beyond. Neur IPS, 2019. Jun S Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 2378 2386, 2016. Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3): 1942 1992, 2021. doi: 10.3150/20-BEJ1297. Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18(134):1 35, 2017. Grigori Noah Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics. Springer Science & Business Media, 2013. Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. ar Xiv preprint ar Xiv:1907.11331, 2019. Wenlong Mou, Yi-An Ma, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. High-order langevin diffusion yields an accelerated mcmc algorithm. Journal of Machine Learning Research, 22(42):1 41, 2021. Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014. Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1): 255 268, 1998. Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996. Published as a conference paper at ICLR 2022 Andreas R oßler. Runge kutta methods for the strong approximation of solutions of stochastic differential equations. SIAM Journal on Numerical Analysis, 48(3):922 952, 2010. W R uemelin. Numerical treatment of stochastic differential equations. SIAM Journal on Numerical Analysis, 19(3):604 613, 1982. Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pp. 2098 2109, 2019. Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, pp. 8092 8104, 2019. Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference On Learning Theory, pp. 2093 3027, 2018. Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning, pp. 5737 5746, 2018. Published as a conference paper at ICLR 2022 A PROOF OF RESULTS IN SECTION 3 A.1 PROOF OF THEOREM 3.3 (GLOBAL INTEGRATION ERROR, INFINITE TIME VERSION) Proof. We write the solution of an SDE by xt0,xt0(t0 + t) when the dependence on initialization needs highlight. Denote tk = kh and xtk = xk for better readability. We will first make an easy observation that contraction and bounded 2nd-moment of the invariant distribution lead to bounded 2nd-moment of the SDE solution for all time: let y0 be a random variable following the invariant distribution of Eq. (1), i.e., y0 µ, then yt µ and E xt 2 2E xt yt 2 + 2E yt 2 2E x0 y0 2 exp( 2βt) + 2E yt 2 4E( x0 2 + y0 2) exp( 2βt) + 2E yt 2 =4E x0 2 exp( 2βt) + 2 + 4 exp( 2βt) Ey µ y 2 4E x0 2 + 6 Z Rd y 2 dµ U 2 and then it follows that E xk 2 2E xk xk 2 + 2E xk 2 2e2 k + 2U 2. (13) Denote x, y A = Ax, Ay , x A = Ax and fk = n E xk xk 2 A o 1 where A is the non-singular matrix from Equation (4). Also denote that largest and smallest singular values of A by σmax and σmin, respectively, and the condition number of A by κA = σmax σmin . Recall ek = E xk xk , it is easy to see that σminek fk σmaxek. (15) Further, we have the following decomposition f 2 k+1 =E xk+1 xk+1 2 A =E xtk,xtk (tk+1) xtk, xk(tk+1) + xtk, xk(tk+1) xk+1 2 = E xtk,xtk (tk+1) xtk, xk(tk+1) 2 + E xtk, xk(tk+1) xk+1 2 A | {z } 2 + 2 E A xtk,xtk (tk+1) xtk, xk(tk+1) , A xtk, xk(tk+1) xk+1 | {z } 3 Term 1 is taken care of the contraction property E xtk,xtk (tk+1) xtk, xk(tk+1) 2 A f 2 k exp( 2βh). (17) Term 2 is dealt with by the bound on local strong error E xtk, xk(tk+1) xk+1 2 A σ2 max C2 2 + D2 2E xk 2 h2p2. (18) Published as a conference paper at ICLR 2022 Term 3 requires more efforts to cope with, and by the decomposition in Eq. (5) we have E (xtk,xtk (tk+1) xtk, xk(tk+1), xtk, xk(tk+1) xk+1 A =E xk xk, xtk, xk(tk+1) xk+1 A + E zh(xk, xk), xtk, xk(tk+1) xk+1 A (i) =E xk xk, E[xtk, xk(tk+1) xk+1|Fk] A + E zh(xk, xk), xtk, xk(tk+1) xk+1 A (ii) fk E E[xtk, xk(tk+1) xk+1|Fk] 2 A 2 + E zh(xk, xk) 2 A 2 E xtk, xk(tk+1) xk+1 2 A (iii) σmaxfk E E[xtk, xk(tk+1) xk+1|Fk] 2 1 2 + σ2 max E zh(xk, xk) 2 1 2 E xtk, xk(tk+1) xk+1 2 1 (iv) σmaxfk E xk 2 hp1 + κAσmax C0fk h C2 + D2 q (v) κAσmax(C1 + C0C2)ekhp2+ 1 2 + κAσmax(D1 + C0D2) q E xk 2fkhp2+ 1 where (i) uses the tower property of conditional expectation and Fk is the filtration at k-th iteration, (ii) uses Cauchy-Schwarz inequality, (iii) is due to the relationship between ek and fk, (iv) is due to local weak error, local strong error and Eq. (5), and (v) is due to p1 p2 + 1 2 and 0 < h h0 1. Now plug Eq. (17), (18) and (19) in Eq. (16), we obtain f 2 k+1 f 2 k exp( 2βh) + σ2 max C2 2 + D2 2E xk 2 h2p2 + κAσmax(C1 + C0C2)fkhp2+ 1 + κAσmax(D1 + C0D2) q E xk 2fkhp2+ 1 8βh f 2 k + σ2 max C2 2 + D2 2E xk 2 h2p2 + κAσmax(C1 + C0C2)fkhp2+ 1 + κAσmax(D1 + C0D2) q E xk 2fkhp2+ 1 8βh f 2 k + κAσmax C1 + C0C2 + 2U(D1 + C0D2) fkhp2+ 1 2 + 2κ2 AD2 2f 2 kh2p2 2κ2 A(D1 + C0D2)f 2 khp2+ 1 2 + σ2 max C2 2 + 2D2 2U 2 h2p2 8βh f 2 k + κAσmax C1 + C0C2 + 2U(D1 + C0D2) fkhp2+ 1 + σ2 max C2 2 + 2D2 2U 2 h2p2 2βh f 2 k + κAσmax C1 + C0C2 + 2U(D1 + C0D2) fkhp2+ 1 + σ2 max C2 2 + 2D2 2U 2 h2p2 2βh f 2 k + β 4 f 2 kh + κ2 Aσ2 max C1 + C0C2 + 2U(D1 + C0D2) 2 + σ2 max C2 2 + 2D2 2U 2 h2p2 4βh f 2 k + κ2 Aσ2 max C1 + C0C2 + 2U(D1 + C0D2) 2 β + C2 2 + 2D2 2U 2 where (i) is due to the assumption 0 < h 1 4β and e x 1 x + x2 2 for 0 < x < 1, (ii) is due to the upper bound on E xk 2 in Eq. (13), (iii) holds provided when h Published as a conference paper at ICLR 2022 2κ2 A(D1+C0D2) and (iv) is due to Cauchy-Schwarz inequal- Unfolding the above inequality gives us β κ2 Aσ2 max C1 + C0C2 + 2U(D1 + C0D2) 2 β + C2 2 + 2D2 2U 2 Taking square root on both sides and using a2 + b2 + c2 a + b + c, a, b, c 0 yields fk+1 2 β κAσmax C1 + C0C2 + 2U(D1 + C0D2) β + C2 + Finally using the relationship between ek and fk, we obtain ek 2 β κ2 A C1 + C0C2 + 2U(D1 + C0D2) β + C2 + A.2 PROOF OF THEOREM 3.4 (NON-ASYMPTOTIC SAMPLING ERROR BOUND: GENERAL CASE) Proof. Let y0 µ and (x0, y0) are coupled such that E x0 y0 2 = W 2 2 (Law(x0), µ). Denote the solution of Eq. (1) starting from x0, y0 by xt, yt respectively, and tk = kh. We have W2(Law( xk), µ) W2(Law( xk), Law(xtk)) + W2(Law(xtk), µ) E xk xtk 2 + E x0 y0 2 exp ( 2βtk) =ek + exp ( βtk) W2(Law(x0), µ) where (i) is due to the contraction assumption on Eq. (1). Invoking the conclusion of Theorem 3.3 completes the proof. A.3 PROOF OF COROLLARY 3.5 (UPPER BOUND OF MIXING TIME: GENERAL CASE) Proof. Given any tolerance ϵ > 0, we know from Theorem 3.4 that if k is large enough and h is small enough such that exp ( βkh) W2(Law(x0), µ) ϵ we then have W2(Law( xk), µ) ϵ. Solving Inequality (20) yields βh log 2W2(Law(x0), µ) To minimize the lower bound, we want pick step size h as large as possible. Besides h h1, Eq. (21) poses further constraint on h, hence we have Published as a conference paper at ICLR 2022 Plug the upper bound of h in Eq. (22), we have log 2W2(Law(x0), µ) When high accuracy is needed, i.e., ϵ < 2Ch p2 1 2 1 , we have 2 log 2W2(Law(x0), µ) B PROOF OF RESULTS IN SECTION 4 B.1 PROOF OF THEOREM 4.1 (NON-ASYMPTOTIC ERROR BOUND: LMC) Proof. From Lemma C.1 we know that Langevin dynamics is a member of the family of contractive SDE, and with a contraction rate of strong-convexity coefficient β = m (w.r.t. identity matrix Id d). Next, we will need to work out the constants C0, C1, D1, D2, C2 needed in Theorem 3.3. We have C0 = m 2 , implied from Lemma C.3. The local strong error and local weak error are bounded in Lemma D.1 and D.2 respectively. Note that the coefficient e C1/ e C2 in the bound for local strong/weak error depends on initial value, which changes from iteration to iteration. Combined with Lemma D.3, we would obtain C1 and C2, namely e C1 2(L2 + G) d 4κL + E x0 2 + 8d 2 2(L2 + G) m + E x0 2 + 1 C1 E x0 2 + 8d m + E x0 2 + 1 C2. We collect all constants here in the proof for easier reference A = Id d, κA = 1, β = m, h0 = 1 4κL, C0 = m C1 = 2(L2 + G) m + E x0 2 + 1, D1 = 0 m + E x0 2 + 1, D2 = 0. Then the constant in Theorem 3.3 for LMC algorithm simplifies to C1 + C0C2 β + C2 2d + m E x0 2 + 1 CLMC. Assuming L, m, G are all constants and independent of d, then clearly CLMC = O( d). Then applying Theorem 3.4 to LMC, we have W2(Law( xk), µ) e mkh W2(Law(x0), µ) + CLMCh (23) for 0 < h 1 4κL. Published as a conference paper at ICLR 2022 B.2 PROOF OF THEOREM 4.3 (LOWER BOUND OF MIXING TIME) Proof. If we start from x0 = 12d and run LMC for the potential function in Eq. (11), we then have ( (1 mh)k(x0)i + 2h Pk l=1(1 mh)k l(ξl)i, 1 i d (1 Lh)k(x0)i + 2h Pk l=1(1 Lh)k l(ξl)i, d + 1 i 2d N (1 mh)k, 2 m(2 mh) 1 (1 mh)2k , 1 i d N (1 Lh)k, 2 L(2 Lh) 1 (1 Lh)2k , d + 1 i 2d Clearly, stability requires h < 2 The squared 2-Wasserstein distance between the law of the k-th iterate of LMC and target distribution is W 2 2 (Law( xk), µ) =d(1 mh)2k + d 1 (1 mh)2k 1 +d(1 Lh)2k + d 1 (1 Lh)2k 1 Suppose W2(Law( xk), µ) ϵ, we then must have d(1 mh)2k ϵ2 (24) 1 (1 mh)2k 1 A necessary condition of Eq. (25) is that 1 (1 mh)2k (i) where (i) is due to Eq. (24). It follows from Eq. (26) and m = 1 that Revisiting Eq. (24) yields ϵ2 d(1 mh)2k (i) d 1 2mh + (2mh)2 !2k (ii) de 4mkh k 1 2hm log where (i) is due to mh < 2 2 and (ii) is due to e x 1 x + x2 2 , 0 < x < 1. Combine Eq. (27) and (28), we then obtain a lower bound of the mixing time Published as a conference paper at ICLR 2022 C SOME PROPERTIES OF LANGEVIN DYNAMICS C.1 CONTRACTION OF LANGEVIN DYNAMICS Lemma C.1. Suppose Assumption 1 holds. Then two copies of overdamped Langevin dynamics have the following contraction property n E yt xt 2o 1 2 n E y x 2o 1 where x, y are the initial values of xt, yt. Proof. First assume x, y are deterministic. Suppose xt, yt are respectively the solutions to dxt = f(xt)dt + dyt = f(yt)dt + where Bt is a standard d-dimensional Brownian motion. Denote Lt = 1 2E yt xt 2 and take time derivative, we obtain d dt Lt = E yt xt, f(yt) f(xt) (i) m E yt xt 2 = 2m Lt where (i) is due to the strong-convexity assumption made on f. We then obtain Lt L0 exp( 2mt) and it follows by Gronwall s inequality that n E yt xt 2o 1 2 y x exp( mt). When x, y are random, by the conditioning version of the above inequality and Jensen s inequality, we have E yt xt 2 x, y n E y x 2 exp( 2mt) o 1 2 = n E y x 2o 1 2 exp( mt). C.2 GROWTH BOUND OF LANGEVIN DYNAMICS Lemma C.2. Suppose Assumption 1 holds, then when 0 h 1 4κL, the solution of overdamped Langevin dynamics xt satisfies E xh x 2 6 d + m where x is the initial value at t = 0. Published as a conference paper at ICLR 2022 Proof. We have E xh x 2 =E 0 f(xt)dt + f(xt) f(x) dt + Z h 0 xt x dt + h f(x) !2 + h2 f(x) 2 (ii) 4hd + 4h2E f(x) 2 + 4L2h Z h 0 E xt x 2 dt where (i) is due to Ito s isometry, (ii) is due to Cauchy-Schwarz inequality. By Gronwall s inequality, we obtain E xh x 2 4h d + h E f(x) 2 exp n 4L2h2o . Since f(x) = f(x) f(0) L x , when 0 < h 1 4κL, we finally reach at E xh x 2 4e 1 4 d + 2h L2E x 2 h 6 d + m C.3 BOUND ON EVOLVED DEVIATION Lemma C.3. Suppose Assumption 1 holds. Let xt, yt be two solutions of overdamped Langevin dynamics starting from x, y respectively, for 0 < h 1 4κL, we have the following representation xh yh = x y + z 4 E x y 2 h. Proof. Let z = (xh yh) (x y) = R h 0 f(xs) f(ys)ds. Ito s lemma readily implies that E xh yh 2 =E x y 2 2E Z h 0 xs ys, f(xs) f(ys) ds (i) E x y 2 2m Z h 0 E xs ys 2 ds Published as a conference paper at ICLR 2022 where (i) is due to strong-convexity of f. We then have that 0 f(xs) f(ys)ds E f(xs) f(ys) ds E f(xs) f(ys) 2 ds 0 E f(xs) f(ys) 2 ds 0 E xs ys 2 ds L2E x y 2 h2 4 E x y 2 h where (i) is due to h 1 4κL. D SOME PROPERTIES OF LMC ALGORITHM D.1 LOCAL STRONG ERROR Lemma D.1. Suppose Assumption 1 holds. Denote the one-step iteration of LMC algorithm with step size h by x1 and the solution of overdamped Langevin dynamics at time t = h by xh. Both the discrete algorithm and the continuous dynamics start from the same initial value x. If 0 h 1 4κL, then the local strong error of LMC algorithm satisfies n E x1 xh 2o 1 2 e C2h 3 2 with e C2 = 2L d + m Proof. We have for 0 h 1 4κL, E x1 xh 2 =E 0 f(xs) f(x)ds f(xs) f(x) ds (i) L2h Z h 0 E xs x 2 ds (ii) 3L2 d + m where (i) is due to Cauchy-Schwartz inequality and (ii) is due to Lemma C.2. Taking square roots on both side completes the proof. Published as a conference paper at ICLR 2022 D.2 LOCAL WEAK ERROR Lemma D.2. Suppose Assumption 1 and 2 hold. Denote the one-step iteration of LMC algorithm with step size h by x1 and the solution of overdamped Langevin dynamics at time t = h by xh. Both the discrete algorithm and the continuous dynamics start from the same initial value x. If 0 h 1 4κL, then the local weak error of LMC algorithm satisfies E x1 Exh e C1h2 with e C1 = 2(L2 + G) d 4κL + E x 2 + 1 1 Proof. By Ito s lemma, we have d f(xt) = 2f(xt) f(xt)dt + ( f(xt))dt + 0 2f(xt)d Bt. It follows that 0 f(xs) f(x)ds 0 2f(xr) f(xr) + ( f(xr))drds + 0 2f(xr)d Brds 0 2f(xr) f(xr) + ( f(xr))drds 0 E 2f(xr) f(xr) drds + Z h 0 E ( f(xr)) drds 0 E f(xr) drds + Z h 0 E ( f(xr)) drds (i) (L2 + G) Z h 0 E xr drds + G 0 E xr x drds + h2 (ii) (L2 + G) E xr x 2drds + h2 (iii) (L2 + G) 2 E x 2 rdrds + h2 2 E x 2 h + 1 (iv) (L2 + G)h2 s d + m 2 E x 2 h + 1 (v) (L2 + G)h2 r d 4κL + E x 2 + G d 4κL + E x 2 + 1 2(L2 + G) d 4κL + E x 2 + 1 1 Published as a conference paper at ICLR 2022 where (i) is due to Assumption 2, (ii) is due to Jensen s inequality, (iii) is due to Lemma C.2, (iv) is due to a + a2 + b2 and (v) is due to h 1 4κL. It is worth noting in the third equation that the Ito s correction term f can also be written as f as the two operators commute for C3 functions. D.3 BOUNDEDNESS OF LMC ALGORITHM Lemma D.3. Suppose Assumption 1 holds. Denote the iterates of LMC by xk. If 0 h 1 4κL we then have the iterates of LMC algorithm are uniformly upper bounded by E xk 2 E x0 2 + 8d Proof. We have E xk+1 2 =E xk h f( xk) + (i) =E xk 2 + h2E f( xk) 2 + 2hd 2h E xk, f( xk) =E xk 2 + h2E f( xk) f(0) 2 + 2hd 2h E xk, f( xk) (ii) E xk 2 + h2L2E xk 2 + 2hd 2h E xk, f( xk) (iii) E xk 2 + h2L2E xk 2 + 2hd 2mh E xk 2 4mh E xk 2 + 2hd where (i) is due to the independence between ξk+1 and xk, (ii) is due to Assumption 1, (iii) is due to the property of m-strongly-convex functions, f(y) f(x), y x m y x 2 x, y Rd, and (iv) uses the assumption h 1 4κL. Unfolding the inequality, we obtain E xk 2 (1 7 4mh)k E x0 2 + 2hd 1 + 7 4mh)k 1 E x0 2 + 8d