# stochastic_variancereduced_hamilton_monte_carlo_methods__4c690266.pdf

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Difan Zou * 1 Pan Xu * 1 Quanquan Gu 1

Abstract We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve accuracy in 2-Wasserstein distance, our algorithm achieves e O

n + 2d1/2/ + 4/3d1/3n2/3/ 2/3#

gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.

1. Introduction

Past decades have witnessed increasing attention of Markov Chain Monte Carlo (MCMC) methods in modern machine learning problems (Andrieu et al., 2003). An important family of Markov Chain Monte Carlo algorithms, called Langevin Monte Carlo method (Neal et al., 2011), is proposed based on Langevin dynamics (Parisi, 1981). Langevin dynamics was used for modeling of the dynamics of molecular systems, and can be described by the following Itˆo s stochastic differential equation (SDE) (Øksendal, 2003),

d Xt = rf(Xt)dt +

2βd Bt, (1.1)

where Xt is a d-dimensional stochastic process, t 0 denotes the time index, β > 0 is the temperature parameter, and Bt is the standard d-dimensional Brownian motion. Under certain assumptions on the drift coefﬁcient rf, Chiang et al. (1987) showed that the distribution of Xt in (1.1)

*Equal contribution 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. Correspondence to: Quanquan Gu <qgu@cs.ucla.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

converges to its stationary distribution, a.k.a., the Gibbs measure β / exp( βf(x)). Note that β is smooth and log-concave (resp. strongly log-concave) if f is smooth and convex (resp. strongly convex). A typical way to sample from density β is applying Euler-Maruyama discretization scheme (Kloeden & Platen, 1992) to (1.1), which yields

Xk+1 = Xk rf(Xk) +

2 β k, (1.2)

where k N(0, Id d) is a standard Gaussian random vector, Id d is a d d identity matrix, and > 0 is the step size. (1.2) is often referred to as the Langevin Monte Carlo (LMC) method. In total variation (TV) distance, LMC has been proved to be able to produce approximate sampling of density β / e f/β under arbitrary precision requirement in Dalalyan (2014); Durmus & Moulines (2016b), with properly chosen step size. The non-asymptotic convergence of LMC has also been studied in Dalalyan (2017); Dalalyan & Karagulyan (2017); Durmus et al. (2017), which shows that the LMC algorithm can achieve -precision in 2-Wasserstein distance after e O( 2d/ 2) iterations if f is L-smooth and µstrongly convex, where = L/µ is the condition number.

In order to accelerate the convergence of Langevin dynamics (1.1) and improve its mixing time to the unique stationary distribution, Hamiltonian dynamics (Duane et al., 1987; Neal et al., 2011) was proposed, which is also known as underdampled Langevin dynamics and is deﬁned by the following system of SDEs

d Vt = γVtdt urf(Xt)dt +

d Xt = Vtdt, (1.3)

where γ > 0 is the friction parameter, u denotes the inverse mass, Xt, Vt 2 Rd are the position and velocity of the continuous-time dynamics respectively, and Bt is the Brownian motion. Let Wt = (X>

t )>, under mild assumptions on the drift coefﬁcient rf(x), the distribution of Wt converges to an unique invariant distribution w / e f(x) kvk2

2/(2u) (Neal et al., 2011), whose marginal distribution on Xt, denoted by , is proportional to e f(x). Similar to the numerical approximation of the Langevin dynamics in (1.2), one can also apply the same Euler-Maruyama discretization scheme to Hamiltonian dynamics in (1.3), which gives rise to Hamiltonian Monte

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Carlo (HMC) method

vk+1 = vk γ vk urf(xk) +

xk+1 = xk + vk. (1.4)

(1.4) provides an alternative way to sample from the target distribution / e f(x). While HMC has been observed to outperform LMC in a number of empirical studies (Chen et al., 2014; 2015), there does not exist a non-asymptotic convergence analysis of the HMC method until very recent work by Cheng et al. (2017)1. In particular, Cheng et al. (2017) proposed a variant of HMC based on coupling techniques, and showed that it achieves sampling accuracy in 2-Wasserstein distance within e O( 2d1/2/ ) iterations for smooth and strongly convex function f. This improves upon the convergence rate of LMC by a factor of e O(d1/2/ ).

Both LMC and HMC are gradient based Monte Carlo methods and are effective in sampling from smooth and strongly log-concave distributions. However, they can be slow if the evaluation of the gradient is computationally expensive, especially on large datasets. This motivates using stochastic gradient instead of full gradient in LMC and HMC, which gives rise to Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011; Ahn et al., 2012; Durmus & Moulines, 2016b; Dalalyan, 2017) and Stochastic Gradient Hamilton Monte Carlo (SG-HMC) method (Chen et al., 2014; Ma et al., 2015; Chen et al., 2015) respectively. For smooth and strongly log-concave distributions, Dalalyan & Karagulyan (2017); Dalalyan (2017) proved that the convergence rate of SGLD is e O( 2dσ2/ 2), where σ2 denotes the upper bound of the variance of the stochastic gradient. Cheng et al. (2017) proposed a variant of SG-HMC and proved that it converges after e O( 2dσ2/ 2) iterations. It is worth noting that although using stochastic gradient evaluations reduces the per-iteration cost,it comes at a cost that the convergence rates of SGLD and SG-HMC are slower than LMC and HMC. Thus, a natural questions is:

Does there exist an algorithm that can leverage stochastic gradients, but also achieve a faster rate of convergence?

In this paper, we answer this question afﬁrmatively, when the function f can be written as the ﬁnite sum of n smooth component functions fi

fi(x). (1.5)

It is worth noting that the ﬁnite sum structure is prevalent in machine learning, as the log-likelihood function of a dataset (e.g., f) is the sum of the log-likelihood over each data point (e.g., fi) in the dataset. We propose a stochastic

1In Cheng et al. (2017), the sampling method in (1.4) is also called the underdampled Langevin MCMC algorithm.

variance-reduced HMC (SVR-HMC), which incorporates the variance reduction technique into stochastic HMC. Our algorithm is inspired by the recent advance in stochastic optimization (Roux et al., 2012; Johnson & Zhang, 2013; Xiao & Zhang, 2014; Defazio et al., 2014; Allen-Zhu & Hazan, 2016; Reddi et al., 2016; Lei & Jordan, 2016; Lei et al., 2017), which use semi-stochastic gradients to accelerate the optimization of the ﬁnite-sum function, and to improve the runtime complexity of full gradient methods. We also notice that the variance reduction technique has already been employed in recent work Dubey et al. (2016); Baker et al. (2017) on SGLD. Nevertheless, it does not show an improvement in terms of dependence on the accuracy .

In detail, the proposed SVR-HMC uses a multi-epoch scheme to reduce the variance of the stochastic gradient. At the beginning of each epoch, it computes the full gradient or an estimation of the full gradient based on the entire data. Within each epoch, it performs semi-stochastic gradient descent and outputs the last iterate as the warm up starting point for the next epoch. Thorough experiments on both synthetic and real data demonstrate the advantage of our proposed algorithm.

Our Contributions The major contributions of our work are highlighted as follows.

We propose a new algorithm, SVR-HMC, that incorpo-

rates variance-reduction technique into HMC. Our algorithm does not require the variance of the stochastic gradient is bounded. We proved that SVR-HMC has a better gradient complexity than the state-of-the-art LMC and HMC methods for sampling from smooth and strongly log-concave distributions, when the error is measured by 2-Wasserstein distance. In particular, to achieve sampling error in 2-Wasserstein distance, our algorithm only needs e O

n+ 2d1/2/ + 4/3d1/3n2/3/ 2/3#

number of component gradient evaluations. This improves upon the state-of-the-art result by (Cheng et al., 2017), which is

e O(n 2d1/2/ ) in a large regime.

We generalize the analysis of SVR-HMC to sampling

from smooth and general log-concave distributions by adding a diminishing regularizer. We prove that the gradient complexity of SVR-HMC to achieve -accuracy in 2Wasserstein distance is e O(n + d11/2/ 6 + d11/3n2/3/ 4). To the best of our knowledge, this is the ﬁrst convergence result of LMC methods in 2-Wasserstein distance.

Notation We denote the discrete update by lower case symbol xk and the continuous-time dynamics by upper case symbol Xt. We denote by kxk2 the Euclidean norm of vector x 2 Rd. For a random vector Xt 2 Rd (or xk 2 Rd), we denote its probability distribution function by P(Xt) (or P(xk)). We denote by Eu(X) the expectation of X under

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

probability measure u. The squared 2-Wasserstein distance between probability measures u and v is

2(u, v) = inf 2Γ(u,v)

Rd Rd k Xu Xvk2

2d (Xu, Xv),

where Γ(u, v) is the set of all joint distributions with u and v being the marginal distributions. We use an = O(bn) to denote an Cbn for some constant C > 0 independent of n, and use an = e O(bn) to hide logarithmic terms of bn. We denote an . bn (an & bn) if an is less than (larger than) bn up to a constant. We use a b to denote min{a, b}

2. Related Work

In this section, we brieﬂy review the relevant work in the literature.

Langevin Monte Carlos (LMC) methods (a.k.a, Unadjusted Langevin Algorithms), and its Metropolis adjusted version, have been studied in a number of papers (Roberts & Tweedie, 1996; Roberts & Rosenthal, 1998; Stramer & Tweedie, 1999a;b; Jarner & Hansen, 2000; Roberts & Stramer, 2002), which have been proved to attain asymptotic exponential convergence. In the past few years, there has emerged numerous studies on proving the non-asymptotic convergence of LMC methods. Dalalyan (2014) ﬁrst proposed the theoretical guarantee for approximate sampling using Langevin Monte Carlo method for strongly log-concave and smooth distributions, where he proved rate O(d/ 2) for LMC algorithm with warm start in total variation (TV) distance. This result has later been extended to Wasserstein metric by Dalalyan & Karagulyan (2017); Durmus & Moulines (2016b), where the same convergence rate in 2-Wasserstein distance holds without the warm start assumption. Recently, Cheng & Bartlett (2017) also proved an e O(d/ ) convergence rate of the LMC algorithm in KLdivergence. The stochastic gradient based LMC methods, also known as stochastic gradient Langevin dynamics (SGLD), was originally proposed for Bayesian posterior sampling (Welling & Teh, 2011; Ahn et al., 2012). Dalalyan (2017); Dalalyan & Karagulyan (2017) analyzed the convergence rate for SGLD based on both unbiased and biased stochastic gradients. In particular, they proved that the gradient complexity for unbiased SGLD is O( 2d/ 2), and showed that it may not converge to the target distribution if the stochastic gradient has non-negligible bias. The SGLD algorithm has also been applied to nonconvex optimization. Raginsky et al. (2017) analyzed the non-asymptotic convergence rate of SGLD. Zhang et al. (2017) provided the theoretical guarantee of SGLD in terms of the hitting time to a ﬁrst and second-order stationary point. Xu et al. (2017) provided a analysis framework for the global convergence of LMC, SGLD and its variance-reduced variant based on the ergodicity of the discrete-time algorithm.

In order to improve convergence rates of LMC methods, the Hamiltonian Monte Carlo (HMC) method was proposed Duane et al. (1987); Neal et al. (2011), which introduces a momentum term in its dynamics. To deal with large datasets, stochastic gradient HMC has been proposed for Bayesian learning (Chen et al., 2014; Ma et al., 2015). Chen et al. (2015) investigated the generic stochastic gradient MCMC algorithms with high-order integrators, and provided a comprehensive convergence analysis. For strongly log-concave and smooth distribution, a non-asymptotic convergence guarantee was proved by Cheng et al. (2017) for underdamped Langevin MCMC, which is a variant of stochastic gradient HMC method.

Our proposed algorithm is motivated by the stochastic variance reduced gradient (SVRG) algorithm, was ﬁrst proposed in Johnson & Zhang (2013), and later extended to different problem setups Xiao & Zhang (2014); Defazio et al. (2014); Reddi et al. (2016); Allen-Zhu & Hazan (2016); Lei & Jordan (2016); Lei et al. (2017). Inspired by this line of research, Dubey et al. (2016) applied the variance reduction technique to stochastic gradient Langevin dynamics, and proved a slightly tighter convergence bound than SGLD. Nevertheless, the dependence of the convergence rate on the sampling accuracy is not improved. Thus, it remains open whether variance reduction technique can indeed improve the convergence rate of MCMC methods. Our work answers this question in the afﬁrmative and provides rigorously faster rates of convergence for sampling from log-concave and smooth density functions.

For the ease of comparison, we summarize the gradient complexity2 in 2-Wasserstein distance for different gradientbased Monte Carlo methods in Table 1. Evidently, for sampling from smooth and strongly log-concave distributions, SVR-HMC outperforms all existing algorithms.

Table 1. Gradient complexity of gradient-based Monte Carlo algo-

rithms in 2-Wasserstein distance for sampling from smooth and strong log-concave distributions.

Methods Gradient Complexity

LMC (Dalalyan, 2017) e O

HMC (Cheng et al., 2017) e O

SGLD (Dalalyan, 2017) e O

SG-HMC (Cheng et al., 2017) e O

SVR-HMC (this paper) e O

+ 3/4d1/3n2/3

2The gradient complexity is deﬁned as number of stochastic gradient evaluations to achieve sampling accuracy.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

3. The Proposed Algorithm

In this section, we propose a novel HMC algorithm that leverages variance reduced stochastic gradient to sample from the target distribution = e f(x)/Z, where Z = R

e f(x)dx is the partition function.

Recall that function f(x) has the ﬁnite-sum structure in (1.5). When n is large, the full gradient 1/n Pn

i=1 rfi(x) in (1.4) can be expensive to compute. Thus, the stochastic gradient is often used to improve the computational complexity per iteration. However, due to the non-diminishing variance of the stochastic gradient, the convergence rate of gradient-based MC methods using stochastic gradient is often no better than that of gradient MC using full gradient.

In order to overcome the drawback of stochastic gradient, and achieve faster rate of convergence, we propose a Stochastic Variance-Reduced Hamiltonian Monte Carlo algorithm (SVR-HMC), which leverages the advantages of both HMC and variance reduction. The outline of the algorithm is displayed in Algorithm 1. We can see that the algorithm performs in a multi-epoch way. At the beginning of each epoch, it computes the full gradient of the f at some snapshot of the iterate exj. Then it performs the following update for both the velocity and the position variables in each epoch

vk+1 = vk γ vk ugk + v

xk+1 = xk + vk + x

where γ, , u > 0 are tuning parameters, gk is a semistochastic gradient that is an unbiased estimator of rf(xk) and deﬁned as follows,

gk = rfik(xk) rfik(exj) + rf(exj), (3.2)

where ik is uniformly sampled from {1, . . . , n}, and exj is a snapshot of xk that is only updated every m iterations such that k = jm + l for some l = 0, . . . , m 1. And v

k are Gaussian random vectors with zero mean and covariance matrices equal to

k)>] = u(1 e 2γ ) Id d,

γ2 (2γ + 4e γ e 2γ 3) Id d,

γ (1 2e γ + e 2γ ) Id d, (3.3)

where Id d is a d d identity matrix.

The idea of semi-stochastic gradient has been successfully used in stochastic optimization in machine learning to reduce the variance of stochastic gradient and obtains faster convergence rates (Johnson & Zhang, 2013; Xiao & Zhang, 2014; Reddi et al., 2016; Allen-Zhu & Hazan, 2016; Lei & Jordan, 2016; Lei et al., 2017). Apart from the semistochastic gradient, the second update formula in (3.1) also

differs from the direct Euler-Maruyama discretization (1.4) of Hamiltonian dynamics due to the additional Gaussian noise term x

k. This additional Gaussian noise term is pivotal in our theoretical analysis to obtain faster convergence rates of our algorithm than LMC methods. Similar idea has been used in Cheng et al. (2017) to prove the faster rate of convergence of HMC (underdamped MCMC) against LMC. Algorithm 1 Stochastic Variance-Reduced Hamiltonian Monte Carlo (SVR-HMC)

1: initialization: ex0 = 0, ev0 = 0 2: for j = 0, . . . , d K/me 3: eg = rf(exj) 4: for l = 0, . . . , m 1 5: k = jm + l 6: Uniformly sample ik 2 [n] 7: gk = rfik(xk) rfik(exj) + eg 8: xk+1 = xk + vk + x

k 9: vk+1 = vk γ vk ugk + v

k. 10: if l = m 1 11: exj = xk+1 12: end 13: end for 14: end for 15: output: x K

4. Main Theory

In this section, we analyze the convergence of our proposed algorithm in 2-Wasserstein distance between the distribution of the iterate in Algorithm 1, and the target distribution / e f.

Following the recent work Durmus & Moulines (2016a); Dalalyan & Karagulyan (2017); Dalalyan (2017); Cheng et al. (2017), we use the 2-Wasserstein distance to measure the convergence rate of Algorithm 1, since it directly provides the level of approximation of the ﬁrst and second order moments (Dalalyan, 2017; Dalalyan & Karagulyan, 2017). It is arguably more suitable to characterize the quality of approximate sampling algorithms than the other distance metrics such as total variation distance. In addition, while Algorithm 1 performs update on both the position variable xk and the velocity variable vk, only the convergence rate of the position variable xk is of central interest.

4.1. SVR-HMC for Sampling from Strongly

Log-concave Distributions

We ﬁrst present the convergence rate and gradient complexity of SVR-HMC when f is smooth and strongly convex, i.e., the target distribution / e f is smooth and strongly log-concave. We start with the following formal assumptions on the negative log density function.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Assumption 4.1 (Smoothness). There exists a constant L > 0, such that for any x, y 2 Rd, the following holds for any i,

krfi(x) rfi(y)k2 Lkx yk2.

Under Assumption 4.1, it can be easily veriﬁed that function f(x) is also L-smooth.

Assumption 4.2 (Strong Convexity). There exists a constant µ > 0, such that for any x, y 2 Rd, the following holds for any i,

f(x) f(y) hrf(y), x yi + µ

Note that the strong convexity assumption is only made on the ﬁnite sum function f, instead of the individual component function fi s.

Theorem 4.3. Under Assumptions 4.1 and 4.2. Let P(x K) denote the distribution of the last iterate x K, and / e f(x)

denote the stationary distribution of (1.3). Set u = 1/L, γ = 2, x0 = 0, v0 = 0 and = e O(1/ 1/( 1/3n2/3)). Assume kx k2 R for some constant R > 0, where x = arg minx f(x) is the global minimizer of function f(x). Then the output of Algorithm 1 satisﬁes,

e K /(2 )w0 + 4 (2

D3m 3/2, (4.2)

where w0 = W2

, = L/µ is the condition number, is the step size, and m denotes the epoch (i.e., inner loop) length of Algorithm 1. D1, D2 and D3 are deﬁned as follows,

D2 = 13Uv + 8Uf

L , D3 = Uv + 4ud,

in which parameters Uv and Uf are in the order of O(d/µ) and O(d ), respectively.

Remark 4.4. In existing stochastic Langevin Monte Carlo methods (Dalalyan & Karagulyan, 2017; Zhang et al., 2017) and stochastic Hamiltonian Monte Carlo methods (Chen et al., 2014; 2015; Cheng et al., 2017), their convergence analyses require bounded variance of stochastic gradient, i.e., the inequality Ei[krfi(x) rf(x)k2

2] σ2 holds uniformly for all x 2 Rd. In contrast, our analysis does not need this assumption, which implies that our algorithm is applicable to a larger class of target density functions.

In the following corollary, by providing a speciﬁc choice of step size , and epoch length m, we present the gradient complexity of Algorithm 1 in 2-Wasserstein distance.

Corollary 4.5. Under the same conditions as in Theorem 4.3, let m = n and = O

/( 1d 1/2) 2/3/( 1/3d1/3n2/3)

. Then the output of Algorithm 1 satisﬁes W2

+ n2/3 4/3d1/3

stochastic gradient evaluations.

Remark 4.6. Recall that the gradient complexity of HMC is e O(n 2d1/2/ ) and the gradient complexity of SG-HMC is e O( 2dσ2/ 2), both of which are recently proved in Cheng et al. (2017). It can be seen from Corollary 4.5 that the gradient complexity of our SVR-HMC algorithm has a better dependence on dimension d.

Note that the gradient complexity of SVR-HMC in (4.3) depends on the relationship between sample size n and precision parameter . To make a thorough comparison with existing algorithms, we discuss our result for SVR-HMC in the following three regimes:

When n . d1/4/ 1/2, the gradient complexity of our

algorithm is dominated by e O( 2d1/2/ ), which is lower than that of the HMC algorithm by a factor of e O(n) and lower than that of the SG-HMC algorithm by a factor of

e O(d1/2/ ).

When d1/4/ 1/2 . n . dσ3/ 2, the gradient complexity of our algorithm is dominated by

e O(n2/3 4/3d1/3/ 2/3). It improves that of HMC by a factor of e O(n1/3 2/3d1/6/ 4/3), and is lower than that of SG-HMC by a factor of e O( 2/3d2/3σ2n 2/3/ 4/3). Plugging in the upper bound of n into (4.3) yields

e O( 2dσ2/ 2) gradient complexity, which still matches that of SG-HMC.

When n & 4d/ 2, i.e., the sample size is super large,

the gradient complexity of our algorithm is dominated by

e O(n). It is still lower than that of HMC by a factor of e O( 2d1/2/ ). Nonetheless, our algorithm has a higher gradient complexity than SG-HMC due to the extremely large sample size. This suggests that SG-HMC (Cheng et al., 2017) is the most suitable algorithm in this regime.

Moreover, from Corollary 4.5 we know that the optimal learning rate for SVR-HMC is in the order of O( 2/3/( 1/3d1/3n2/3)), while the optimal learning rate for SG-HMC is in the order of O( 2/(σ2d ))), which is smaller than the learning rate of SVR-HMC when n dσ3/ 2 (Dalalyan, 2017). This observation aligns with the consequence of variance reduction in the ﬁeld of optimization.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

4.2. SVR-HMC for Sampling from General

Log-concave Distributions

In this section, we will extend the analysis of the proposed algorithm SVR-HMC to sampling from distributions which are only general log-concave but not strongly log-concave.

In detail, we want to sample from the distribution / e f(x), where f is general convex and L-smooth. We follow the similar idea in Dalalyan (2014) to construct a strongly log-concave distribution by adding a quadratic regularizer to the convex and L-smooth function f, which yields

f(x) = f(x) + λkxk2

where λ > 0 is a regularization parameter. Apparently, f is λ-strongly convex and (L + λ)-smooth. Then we can apply Algorithm 1 to function f, which amounts to sampling from the modiﬁed target distribution / e f. We will obtain a sequence {xk}k=0,...,K, whose distribution converges to a unique stationary distribution of Hamiltonian dynamics (1.3), denoted by . According to Neal et al. (2011), is propositional to e f(x), i.e.,

Denote the distribution of xk by P(xk).We have

W2(P(xk, )) W2(P(xk), ) + W2( , ). (4.4)

To bound the 2-Wasserstein distance between P(xk) and the desired distribution , we only need to upper bound the 2-Wasserstein distance between two Gibbs distribution and . Before we present our theoretical characterization on this distance, we ﬁrst lay down the following assumption.

Assumption 4.7. Regarding distribution / e f, its fourth-order moment is upper bounded, i.e., there exists a constant U such that E [kxk4

The following theorem spells out the convergence rate of SVR-HMC for sampling from a general log-concave distribution.

Theorem 4.8. Under Assumptions 4.1 and 4.7, in order to sample for a general log-concave density / e f(x), the output of Algorithm 1 when applied to f(x) = f(x) + λkxk2

2/2 satisﬁes W2

6 + d11/3n2/3

gradient evaluations.

Regarding sampling from a smooth and general log-concave distribution, to the best of our knowledge, there is no existing theoretical analysis on the convergence of LMC algorithms in 2-Wasserstein distance. Yet the convergence analyses of LMC methods in total variation distance (Dalalyan,

2014; Durmus et al., 2017) and KL-divergence (Cheng & Bartlett, 2017) have recently been established. In detail, Dalalyan (2014) proved a convergence rate of e O(d3/ 4) in total variation distance for LMC with general log-concave distributions, which implies e O(nd3/ 4) gradient complexity. Durmus et al. (2017) improved the gradient complexity of LMC in total variation distance to e O(nd5/ 2). (Cheng & Bartlett, 2017) proved the convergence of LMC in KLdivergence, which attains e O(nd/ 3) gradient complexity. It is worth noting that our convergence rate in 2-Wasserstein distance is not directly comparable to the aforementioned existing results.

5. Experiments

In this section, we compare the proposed algorithm (SVRHMC) with the state-of-the-art MCMC algorithms for Bayesian learning. To compare the convergence rates for different MCMC algorithms, we conduct the experiments on both synthetic data and real data.

We compare our algorithm with SGLD (Welling & Teh, 2011), VR-SGLD (Reddi et al., 2016), HMC (Cheng & Bartlett, 2017) and SG-HMC (Cheng & Bartlett, 2017).

5.1. Simulation Based on Synthetic Data

On the synthetic data, we construct each component function to be fi(x) = (x ai)> (x ai)/2, where ai is a Gaussian random vector drawn from distribution N(2, 4 Id d), and is a positive deﬁnite symmetric matrix with maximum eigenvalue L = 3/2 and minimum eigenvalue µ = 2/3. Note that each random vector ai leads to a particular component function fi(x). Then it can be observed that the target density / exp

(x a)> (x a)/2

is a multivariate Gaussian distribution with mean a = 1/n Pn

i=1 ai and covariance matrix . Moreover, the negative log density f(x) is L-smooth and µ-strongly convex.

In our simulation, we investigate different dimension d and number of component functions n, and show the 2Wasserstein distance between the target distribution and that of the output from different algorithms with respect to the number of data passes. In order to estimate the 2Wasserstein distance between the distribution of each iterate and the target one, we repeat all algorithms for 20, 000 times and obtain 20, 000 random samples for each algorithm in each iteration. In Figure 1, we present the convergence results for three HMC based algorithms (HMC, SG-HMC and SVR-HMC). It is evident that SVR-HMC performs the best among these three algorithms when n is not large enough, and its performance becomes close to that of SG-HMC when the number of component function is increased. This

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

0 200 400 600 800 1000 1200 1400 1600 1800 2000 10-3

(a) d = 10, n = 50

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 10-3

(b) d = 10, n = 100

0 100 200 300 400 500 600 700 800 900 1000 10-3

(c) d = 10, n = 1000

0 50 100 150 200 250 300 350 400 450 500 10-3

(d) d = 10, n = 5000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 10-2

(e) d = 50, n = 50

0 100 200 300 400 500 600 700 800 900 1000 10-2

(f) d = 50, n = 100

0 50 100 150 200 250 300 350 400 450 500 10-2

(g) d = 50, n = 1000

0 10 20 30 40 50 60 70 80 90 100 10-2

(h) d = 50, n = 5000

Figure 1. Numerical results for synthetic data, where we compare 3 different algorithms, and show their convergence performance in 2-Wasserstein distance. (a)-(h) represent for different dimensions d and sample sizes n.

phenomenon is well-aligned with our theoretical analysis, since the gradient complexity of our algorithm can be worse than SG-HMC when the sample size n is extremely large.

5.2. Bayesian Logistic Regression for Classiﬁcation

Table 2. Summary of datasets for Bayesian classiﬁcation Dataset pima a3a gisette mushroom

n (training) 384 3185 6000 4062 n (test) 384 29376 1000 4062 d 8 122 5000 112

Now, we apply our algorithm to the Bayesian logistic regression problems. In logistic regression, given n i.i.d. examples {ai, yi}i=1,...,n, where ai 2 Rd and yi 2 {0, 1} denote the features and binary labels respectively, the probability mass function of yi given the feature ai is modelled as p(yi|ai, x) = 1/

1 + e yix>ai#

, where x 2 Rd

is the regression parameter. Considering the prior p(x) = N(0, λ 1I), the posterior distribution takes the form

p(x|A, Y ) / p(Y |A, x)p(x) =

p(yi|ai, β)p(x).

where A = [a1, a2, . . . , an]> and Y = [y1, y2, . . . , yn]>. The posterior distribution can be written as p(x|A, Y ) /

i=1 fi(x), where each fi(x) is in the following form

fi(x) = n log

1 + exp( yix>ai)

We use four binary classiﬁcation datasets from Libsvm (Chang & Lin, 2011) and UCI machine learning repository (Lichman, 2013), which are summarized in Table 3. Note that pima and mushroom do not have test data in their original version, and we split them into 50% for training and 50% for test. Following Welling & Teh (2011); Chen et al. (2014; 2015), we report the sample path average and discard the ﬁrst 50 iterations as burn-in. It is worth noting that we observe similar convergence comparison of different algorithms for larger burn-in period (= 104). We run each algorithm 20 times and report the averaged results for comparison. Note that variance reduction based algorithms (i.e., VR-SGLD and SVR-HMC) require the ﬁrst data pass to compute one full gradient. Therefore, in Figure 2, plots of VR-SGLD and VRHMC start from the second data pass while plots of SGLD and SGHMC start from the ﬁrst data pass. It can be clearly seen that our proposed algorithm is able to converge faster than SGLD and SG-HMC on all datasets, which validates our theoretical analysis of the convergence rate. In addition, although there is no existing non-asymptotic theoretical guarantee for VR-SGLD when the target distribution is strongly log-concave, from Figure 2, we can observe that SVR-HMC also outperforms VRSGLD on these four datasets, which again demonstrates the superior performance of our algorithm. This clearly shows the advantage of our algorithm for Bayesian learning.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

1 3 5 7 9 11 13 15 17 19 0.48

SGLD SGHMC VR-SGLD SVR-HMC

1 3 5 7 9 11 13 15 17 19 0.32

SGLD SGHMC VR-SGLD SVR-HMC

0 5 10 15 20 25 30 0

SGLD SGHMC VR-SGLD SVR-HMC

(c) gisette

1 3 5 7 9 11 13 15 17 19 0

SGLD SGHMC VR-SGLD SVR-HMC

(d) mushroom

Figure 2. Comparison of different algorithms for Bayesian logistic regression, where y axis shows the negative log-likelihood on the test

data, and y axis is the number of data passess. (a)-(d) correspond to 4 datasets.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

SGLD SGHMC VR-SGLD SVR-HMC

(a) geographical

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 10-3

SGLD SGHMC VR-SGLD SVR-HMC

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 10-2

SGLD SGHMC VR-SGLD SVR-HMC

(c) parkinson

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 10-4

SGLD SGHMC VR-SGLD SVR-HMC

Figure 3. Comparison of different algorithms for Bayesian linear regression, where y axis is the mean square errors on the test data, and x

axis is the number of data passess. (a)-(d) correspond to 4 datasets.

Table 3. Test error of different algorithms for Bayesian classiﬁcation after 10 entire data passes on 4 datasets Dateset pima a3a gisette mushroom

SGLD 0.2314 0.0044 0.1594 0.0018 0.0098 0.0009 (6.647 + 2.251) 10 4

SGHMC 0.2306 0.0079 0.1591 0.0044 0.0096 0.0006 (5.916 2.734) 10 4

VR-SGLD 0.2299 + 0.0056 0.1572 0.0012 0.0105 0.0006 (7.755 3.231) 10 4

SVR-HMC 0.2289 0.0043 0.1570 0.0019 0.0093 0.0011 (6.278 3.149) 10 4

Table 4. Summary of datasets for Bayesian linear regression

Dataset geographical noise parkinson toms

n 1059 1503 5875 45730 d 69 5 21 96

5.3. Bayesian Linear Regression

We also apply our algorithm to Bayesian linear regression, and make comparison with the baseline algorithms. Similar to Bayesian classiﬁcation, given i.i.d. examples {ai, yi}i=1,...,n with yi 2 R, the likelihood of Bayessian linear regression is p(yi|ai, x) = N(x>ai, σ2

a) and the prior is N(0, λ 1I). We use 4 datasets, which are summarized in Table 4. In our experiment, we set σ2

a = 1 and λ = 1, and conduct the normalization of the original data. In addition, we split each dataset into training and test data evenly. Similarly, we compute the sample path average while treating the ﬁrst 50 iterates as burn in. We report the mean square errors on the test data on these 4 datasets in Figure 3 for different algorithms. It is evident that our

algorithm is faster than all the other baseline algorithms on all the datasets, which further illustrates the advantage of our algorithm for Bayesian learning.

6. Conclusions and Future work

We propose a stochastic variance reduced Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. We show that, to achieve accuracy in 2-Wasserstein distance, our algorithm enjoys a faster rate of convergence and better gradient complexity than state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions. Experiments on both synthetic and real data veriﬁed the superior performance of our algorithm. In the future, we will extend our algorithm to non-log-concave distributions and study the symplectic integration techniques such as Leap-frog integration for Bayesian posterior sampling.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. This research was sponsored in part by the National Science Foundation IIS-1618948, IIS-1652539 and Sa TC CNS-1717950. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

Ahn, S., Balan, A. K., and Welling, M. Bayesian posterior

sampling via stochastic gradient ﬁsher scoring. In ICML, 2012.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster

non-convex optimization. In International Conference on Machine Learning, pp. 699 707, 2016.

Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I.

An introduction to mcmc for machine learning. Machine learning, 50(1-2):5 43, 2003.

Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. Control

variates for stochastic gradient mcmc. ar Xiv preprint ar Xiv:1706.05439, 2017.

Bakry, D., Gentil, I., and Ledoux, M. Analysis and geometry

of Markov diffusion operators, volume 348. Springer Science & Business Media, 2013.

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support

vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.

Chen, C., Ding, N., and Carin, L. On the convergence of

stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pp. 2278 2286, 2015.

Chen, T., Fox, E., and Guestrin, C. Stochastic gradient

hamiltonian monte carlo. In International Conference on Machine Learning, pp. 1683 1691, 2014.

Cheng, X. and Bartlett, P. Convergence of langevin mcmc

in kl-divergence. ar Xiv preprint ar Xiv:1705.09048, 2017.

Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I.

Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663, 2017.

Chiang, T.-S., Hwang, C.-R., and Sheu, S. J. Diffusion for

global optimization in rˆn. SIAM Journal on Control and Optimization, 25(3):737 753, 1987.

Dalalyan, A. S. Theoretical guarantees for approximate

sampling from smooth and log-concave densities. ar Xiv preprint ar Xiv:1412.7392, 2014.

Dalalyan, A. S. Further and stronger analogy between sam-

pling and optimization: Langevin monte carlo and gradient descent. ar Xiv preprint ar Xiv:1704.04752, 2017.

Dalalyan, A. S. and Karagulyan, A. G. User-friendly guaran-

tees for the langevin monte carlo with inaccurate gradient. ar Xiv preprint ar Xiv:1710.00095, 2017.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D.

Hybrid monte carlo. Physics letters B, 195(2):216 222, 1987.

Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B.,

Smola, A. J., and Xing, E. P. Variance reduction in stochastic gradient langevin dynamics. In Advances in Neural Information Processing Systems, pp. 1154 1162, 2016.

Durmus, A. and Moulines, E. High-dimensional bayesian

inference via the unadjusted langevin algorithm. 2016a.

Durmus, A. and Moulines, E. Sampling from strongly log-concave distributions with the unadjusted langevin algorithm. ar Xiv preprint ar Xiv:1605.01559, 2016b.

Durmus, A., Moulines, E., et al. Nonasymptotic conver-

gence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017.

Eberle, A., Guillin, A., and Zimmer, R. Couplings and quan-

titative contraction rates for langevin dynamics. ar Xiv preprint ar Xiv:1703.01617, 2017.

Jarner, S. F. and Hansen, E. Geometric ergodicity of metropolis algorithms. Stochastic processes and their applications, 85(2):341 361, 2000.

Johnson, R. and Zhang, T. Accelerating stochastic gradient

descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013.

Kloeden, P. E. and Platen, E. Higher-order implicit strong

numerical schemes for stochastic differential equations. Journal of statistical physics, 66(1):283 314, 1992.

Lei, L. and Jordan, M. I. Less than a single pass: Stochas-

tically controlled stochastic gradient method. ar Xiv preprint ar Xiv:1609.03261, 2016.

Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convex

ﬁnite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2345 2355, 2017.

Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Lichman, M. UCI machine learning repository, 2013. URL

http://archive.ics.uci.edu/ml.

Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for

stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pp. 2917 2925, 2015.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Hand-

book of Markov Chain Monte Carlo, 2(11), 2011.

Øksendal, B. Stochastic differential equations. In Stochastic

differential equations, pp. 65 84. Springer, 2003.

Parisi, G. Correlation functions and computer simulations.

Nuclear Physics B, 180(3):378 384, 1981.

Raginsky, M., Rakhlin, A., and Telgarsky, M. Nonconvex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. ar Xiv preprint ar Xiv:1702.03849, 2017.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A.

Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314 323, 2016.

Roberts, G. O. and Rosenthal, J. S. Optimal scaling of dis-

crete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255 268, 1998.

Roberts, G. O. and Stramer, O. Langevin diffusions and

metropolis-hastings algorithms. Methodology and computing in applied probability, 4(4):337 357, 2002.

Roberts, G. O. and Tweedie, R. L. Exponential convergence

of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Roux, N. L., Schmidt, M., and Bach, F. R. A stochastic

gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012.

Stramer, O. and Tweedie, R. Langevin-type models i: Dif-

fusions with given stationary distributions and their discretizations. Methodology and Computing in Applied Probability, 1(3):283 306, 1999a.

Stramer, O. and Tweedie, R. Langevin-type models ii: Self-

targeting candidates for mcmc algorithms. Methodology and Computing in Applied Probability, 1(3):307 328, 1999b.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic

gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 681 688, 2011.

Xiao, L. and Zhang, T. A proximal stochastic gradient

method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Xu, P., Chen, J., Zou, D., and Gu, Q. Global convergence

of langevin dynamics based algorithms for nonconvex optimization. ar Xiv preprint ar Xiv:1707.06618, 2017.

Zhang, Y., Liang, P., and Charikar, M. A hitting time anal-

ysis of stochastic gradient langevin dynamics. ar Xiv preprint ar Xiv:1702.05575, 2017.