# aggregated_gradient_langevin_dynamics__47e57b66.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Aggregated Gradient Langevin Dynamics

Chao Zhang,1,2 Jiahao Xie,1 Zebang Shen,3 Peilin Zhao,2 Tengfei Zhou,1 Hui Qian1

1College of Computer Science and Technology, Zhejiang University 2Tencent AI Lab, 3University of Pennsylvania {zczju, xiejh, zhoutengfei, qianhui}@zju.edu.cn, zebang@seas.upenn.edu, masonzhao@tencent.com

In this paper, we explore a general Aggregated Gradient Langevin Dynamics framework (AGLD) for the Markov Chain Monte Carlo (MCMC) sampling. We investigate the nonasymptotic convergence of AGLD with a uniﬁed analysis for different data accessing (e.g. random access, cyclic access and random reshufﬂe) and snapshot updating strategies, under convex and nonconvex settings respectively. It is the ﬁrst time that bounds for I/O friendly strategies such as cyclic access and random reshufﬂe have been established in the MCMC literature. The theoretic results also indicate that methods in AGLD possess the merits of both the low periteration computational complexity and the short mixture time. Empirical studies demonstrate that our framework allows to derive novel schemes to generate high-quality samples for large-scale Bayesian posterior learning tasks.

1 Introduction

We focus on the Langevin dynamics based Markov Chain Monte Carlo (MCMC) methods for sampling the parameter vector θ Rd from a target posterior distribution

p p(θ|{zi}N i=1) p(θ)

i=1 p(zi|θ), (1)

where p(θ) is some prior of θ, zi s are the data points observed, and p(zi|θ) is the likelihood function. The Langevin dynamics Monte Carlo method (LMC) adopts the gradient of log-posterior in an iterative manner to drive the distribution of samples to the target distribution efﬁciently (Roberts and Stramer 2002; Roberts, Tweedie, and others 1996; Mattingly, Stuart, and Higham 2002). To reduce the computational complexity for large-scale posterior learning tasks, the Stochastic Gradient Langevin Dynamics method (SGLD), which replaces the expensive full gradient with the stochastic gradient, has been proposed (Welling and

This work is done when Chao Zhang works as an intern in Tencent AI Lab. Corresponding Author Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Teh 2011). While such scheme enjoys a signiﬁcantly reduced per-iteration cost, the mixture time, i.e., the total number of iterations required to achieve the correction from an out-of-equilibrium conﬁguration to the target posterior distribution, is increased, due to the extra variance introduced by the approximation (Dalalyan and Karagulyan 2017; Dalalyan 2017b). In recent years, efforts are made to design variancecontrol strategies to circumvent this slow convergence issue in the SGLD In particular, borrowing ideas from variance reduction methods in the optimization literature (Johnson and Zhang 2013; Defazio, Bach, and Lacoste-Julien 2014; Lei and Jordan 2017), the variance-reduced SGLD variants exploit the high correlations between consecutive iterates to construct unbiased aggregated gradient approximations with less variance, which leads to better mixture time guarantees (Dubey et al. 2016; Zou, Xu, and Gu 2018b). Among these methods, SAGA-LD and SVRG-LD (Dubey et al. 2016) are proved to be the most effective ones when high-quality samples are required (Chatterji et al. 2018; Zou, Xu, and Gu 2019). While the nonasymptotic convergence guarantees for SVRG-LD and SAGA-LD have been established, it is difﬁcult to seamlessly extend these analyses to cover other Langevin dynamics based MCMC methods with different efﬁcient gradient approximations.

First of all, different delicate Lyapunov functions are designed for SVRG-LD and SAGA-LD to prove the nonasymptotic convergence to the stationary distribution. Due to the different targets of optimization and MCMC, the mixture-time analysis is not a simple transition of the convergence rate analysis in optimization. The lack of a uniﬁed perspective of these variance-reduced SGLD algorithms makes it difﬁcult to effectively explore other variance-reduced estimators used in optimization (e.g., HSAG (Reddi et al. 2015)) for Langevin dynamics based MCMC sampling. In particular, customized Lyapunov functions need to be designed if new variance-reduced estimators are adopted.

Second, existing theoretical analysis relies heavily on the randomness of the data accessing strategy to construct an unbiased estimator of the true gradient. In practice, the random access strategy entails heavy I/O cost, i.e., lots of

data swap between the memory and the disk, when the dataset is too large to ﬁt into memory, thereby renders existing incremental Langevin dynamics based MCMC algorithms heavily impractical for sampling tasks in the big data scenario. While other data accessing strategies such as cyclic access and random reshufﬂe are known to be disk I/O friendly (Xie et al. 2018), existing analysis can not be directly extended to algorithms with these strategies.

Contributions Motivated by such imperatives, we propose a general framework named Aggregated Gradient Langevin Dynamics (AGLD), which maintains a historical snapshot set of the gradient to construct more accurate gradient approximations than SGLD. AGLD possesses a three-step structure: Data-Accessing, Sample-Searching, and Snapshot-Updating. Different Data-Accessing (e.g. random access, cyclic access and random reshufﬂe) and Snapshot-Updating strategies can be utilized in this framework. By appropriately implementing these two steps, we can obtain several practical gradient approximations, including those used in existing methods like SVRG-LD and SAGA-LD. Under mild assumptions, a uniﬁed mixture-time analysis of AGLD is established, which holds as long as each component of the snapshot set is updated at least once in a ﬁxed duration. We list our main contributions as follows.

We ﬁrst analyze the mixture time of AGLD under the assumptions that the negative log-posterior f(x) is smooth and strongly convex and then extend the analysis to the general convex case. We also provide theoretical analysis for nonconvex f(x). These results indicate that AGLD has similar mixture time bounds as LMC under similar assumptions, while the per-iteration computation is much less than that of LMC. Moreover, the analysis provides a uniﬁed bound for a wide class of algorithms with no need to further design dedicated Lyapunov functions for different Data-Accessing and Snapshot-Updating strategies.

For the ﬁrst time, mixture time guarantee for cyclic access and random reshufﬂe Data-Accessing strategies is provided in the Langevin dynamics based MCMC literature. This ﬁlls the gap of practical use and theoretical analyses, since cyclic access is I/O friendly and often used as a practical substitute for random access when the dataset is too large to ﬁt into memory.

We develop a novel Snapshot-Updating strategy, named Time-based Mixture Updating (TMU), which enjoys the advantages of both the Snapshot-Updating strategies used in SVRG-LD and SAGA-LD: it always updates components in the snapshot set to newly computed ones as in SAGA-LD and also periodically updates the whole snapshot set to rule out the out-of-date ones as in SVRG-LD. Plugging TMU into AGLD, we derive novel algorithms to generate high-quality samples for Bayesian learning tasks.

Simulated and real-world experiments are conducted to validate our analysis. Empirical results demonstrate the advantages of proposed variants over the state-of-the-art.

2 Preliminaries

2.1 Wasserstein Distance and Mixture Time

We use the 2-Wasserstein (W2) distance to evaluate the effectiveness of our methods. Speciﬁcally, the W2 distance between two probability measures ρ and ν is deﬁned as

W2 2(ρ, ν) = inf π Γ(ρ,ν){ x y 2 2dπ(x, y)}.

Here, (x, y) are random variables with distribution density π and Γ(ρ, ν) denotes the collection of joint distributions with ρ and ν as its marginals. In this paper, we say K is the ϵ-mixture time of a Monte Carlo sampling procedure if for every k K, the distribution p(k) of the sample generated in the k-th iteration satisﬁes W2(p(k), p ) ϵ.

2.2 Stochastic Langevin Dynamics

By using the discretization of certain dynamics, dynamics based MCMC methods allow us to efﬁciently sample from the target distribution. A large portion of such works are based on the Langevin Dynamics (Parisi 1981)

dθ(t) = θf(θ(t))dt +

2d B(t), (2)

where f is called the drift term, B(t) is a d-dimensional Brownian Motion and θ(t) Rd is the state variable. The classic Langevin dynamics Monte Carlo method (LMC) generates samples {x(k)} in the following manner:

x(k+1) = x(k) η f(x(k)) +

2ηξ(k), (3)

where x(k) is the time discretization of the continuous time dynamics θ(t), η is the stepsize and ξ(k) N(0, Id d) is the d-dimensional Gaussian variable. The distribution p(k) of sample x(k) is shown to converge weakly to the target distribution p (Dalalyan 2017a; Raginsky, Rakhlin, and Telgarsky 2017). To alleviate the expensive full gradient computation in LMC, the Stochastic Gradient Langevin Dynamics (SGLD) replaces f(x(k)) in (3) by the stochastic approximation

i Ik fi(x(k)), (4)

where Ik is the set of n indices drawn from [N] i.i.d. in iteration k and each fi(θ) = log p(θ|zi) log p(θ)/N, for i [N]. Although the g(k) is always an unbiased estimator of the full gradient, the non-diminishing variance results in the inefﬁciency of sample-space exploration and slows down the convergence to the target distribution. To overcome such difﬁculty, SVRG-LD and SAGA-LD (Dubey et al. 2016; Chatterji et al. 2018; Zou, Xu, and Gu 2019) use the two different variance-reduced gradient estimators of f(x), which utilize the component gradient information of the past samples. While possessing similar low per-iteration component gradient computation as in SGLD, the mixture time bound of SVRG-LD and SAGA-LD are shown to be similar to that of LMC (Chatterji et al. 2018).

Algorithm 1 Aggregated Gradient Langevin Dynamics

Require: initial iterate x(0), stepsize η, Data-Accessing strategy, and Snapshot-Updating strategy.

1: Initialize Snapshot set A(0) = {α(0) i }N i=1, where α(0) i = fi(x(0)). 2: for k = 0 to K 1 do 3: Sk = Data-Accessing(k). 4: Sample-Searching: ﬁnd x(k+1) according to (5). 5: A(k+1) = Snapshot-Updating(A(k), x(k), k, Sk). 6: end for

3 Aggregated Gradient Langevin Dynamics In this section, we present our general framework named Aggregated Gradient Langevin Dynamics (AGLD). Speciﬁcally, AGLD maintains a snapshot set consisting of component gradients evaluated in historical iterates. The information in the snapshot set is used in each iteration to construct a gradient approximation which helps to generate the next iterate. Note that iterates generated during the procedure are samples of random variables, whose distributions converge to the target distribution. At the end of each iteration, the entries in the snapshot set are updated according to some strategy. By customizing the steps in AGLD with different strategies, we can derive different algorithms. Concretely, AGLD is comprised of the following three steps, where the ﬁrst and third steps can accept different strategies as inputs. i Data-Accessing: select a subset of indices Sk from [N] according to the input strategy. ii Sample-Searching: construct the aggregated gradient approximation g(k) using the data points indexed by Sk and the historical snapshot set, then generate the next iterate (the new sample) by taking one step along the direction of g(k) with an injected Gaussian noise. Specifically, the (k + 1)-th sample is obtained in the following manner

x(k+1) = x(k) ηg(k) +

2ηξ(k), (5)

where ξ(k) is a Gaussian noise, η is the stepsize, and

n ( fi(x(k)) α(k) i ) +

i=1 α(k) i . (6)

Here, α(k) i s are components in the snapshot set A(k). iii Snapshot-Updating: update historical snapshot set according to the input strategy. We summarize AGLD in Algorithm 1. While our mixture time analyses hold as long as the input Data-Accessing and Snapshot-Updating strategies meet Requirements 1 and 2, we describe in detail several typical qualiﬁed implementations of these two steps below.

3.1 The Data-Accessing Step We make the following requirement on the Data-Accessing step to ensure the convergence of W2 distance between the sample distribution p(k) and the target distribution p .

Requirement 1. In every iteration, each point in the dataset has been visited at least once in the past C iterations, where C is some ﬁxed positive constant. We note that Requirement 1 is general and covers three commonly used data accessing strategies: Random Access (RA), Random Reshufﬂe (RR), and Cyclic Access (CA).

RA: Select uniformly n indices from [N] with replacement;

RR: Select sequentially n indices from [N] with a permutation at the beginning of each data pass;

CA: Select n indices from [N] in a cyclic way.

RA is widely used to construct unbiased gradient approximations in gradient-based Langevin dynamics methods, which is amenable to theoretical analysis. However, in big data scenarios when the dataset does not ﬁt into the memory, RA is not memory-friendly, since it entails heavy data exchange between memory and disks. On the contrary, CA strategy promotes the spatial locality property signiﬁcantly and therefore reduces the page fault rate when handling huge datasets using limited memory (Xie et al. 2018). RR can be considered as a trade-off between RA and CA. However, methods with either CA or RR are difﬁcult to analyze in that the gradient approximation is commonly not an unbiased estimator of the true gradient (Shamir 2016). It can be veriﬁed that these strategies satisfy Requirement 1, For RR, in the k-th iteration, all the data points have been accessed in the past 2N/n iterations. For CA, all the data points are accessed in the past N/n iterations. Note that, RA satisﬁes the Requirement 1 with C = O(N log N) w.h.p., according to the Coupon Collector Theorem (Dawkins 1991).

3.2 The Snapshot-Updating Step

The Snapshot-Updating step maintains a snapshot set A(k)

such that in the k-th iteration, A(k) contains N records α(k) i for fi(y(k) i ) where y(k) i is some historic iterate y(k) i = x(j) with j k. Additionally, for our analyses to hold, the input strategy should satisfy the following requirement.

Requirement 2. The compents in the gradient snapshot set A(k) should have been updated in the past D iterations, i.e. α(k) i { fi(x(j))}k j=k D, where D is a ﬁxed constant.

This requirement guarantees that α(k) i s are not far from the fi(x(k)) s and thus can be used to construct a proper approximation of f(x(k)). The Snapshot-Updating step tries to strike a balance between the approximation accuracy and the computation cost. Speciﬁcally, in each iteration, updating a larger portion of the N entries in the snapshot set would lead to a more accurate gradient approximation at the cost of a higher computation burden. In the following, we list three feasible Snapshot-Updating strategies considered in this paper:Periodically Total Update (PTU), Per-iteration Partial Update (PPU), and Time-based Mixture Update (TMU).

PTU: This strategy operates in an epoch-wise manner: at the beginning of each epoch all the entries in the snapshot set are updated to the current component gradient α(k) i = fi(x(k)), and in the following D 1 iterations the snapshot

Strategy 2 PTU(A(k), x(k), x(k+1), k, Sk)

for i = 1 to N do α(k+1) i = I{mod (k+1,D)=0} fi(x(k+1)) + I{mod (k+1,D) =0}α(k) i end for return Ak+1

Strategy 3 PPU(A(k), θ(k), k, Sk)

for i = 1 to N do α(k+1) i = I{i Sk} fi(x(k)) + I{i/ ik}α(k) i end for return Ak+1

set remains unchanged (see Strategy 2). Such synchronous update to the snapshot set allows us to implement PTU in a memory efﬁcient manner. In the k-th iteration, PTU only needs to store the iterate x and its gradient f( x) where x = xk mod(k,D), as we can obtain the snapshot entry α(k) i via a simple evaluation of the corresponding component gradient at x in the calculation of g(k). Therefore the PTU strategy is preferable when storage is limited.

PPU: This strategy substitutes α(k) i by fi(x(k)) for i Sk in the k-th iteration (see Strategy 3). This partial substitution strategy together with Requirement 1 can ensure the Requirement 2. The downside of PPU is the extra O(d N) memory used to keep the snapshot set A(k). Fortunately, in many applications of interests, fi(x) is actually the product of a scalar and the data point zi, which implies that only O(N) extra storage is needed to store N scalars. TMU: This strategy updates the whole A once every D iterations and substitutes α(k) i by fi(x(k)) in the k-th iteration (see Strategy 4). TMU possesses the merits of both PPU and PTU: it updates components of gradient snapshot set in Sk to newly computed one in each iteration as PPU, and also periodically updates the whole snapshot set as PTU in case that there exist indices unselected for a long time.

Remark 1. PPU is the Snapshot-Updating strategy used in SAGA-LD and PTU is the strategy used in SVRG-LD (Dubey et al. 2016). To the best of our knowledge, TMU has never been proposed in the MCMC literature before. Note that the HSAG Snapshot-Updating strategy proposed by Reddi et al. (2015) also satisﬁes our requirement, and we omit the discussion of it due to the limit of space.

3.3 Derived Algorithms By plugging the aforementioned Data-Accessing and Snapshot-Updating strategies into AGLD, we derive several practical algorithms. We name the algorithms by Snapshotupdating - Data-Accessing , e.g. TMU-RA uses TMU as the Snapshot-Updating strategy and RA as the Data-Accessing strategy. Note that we recover SAGA-LD and SVRG-LD with PPU-RA and PTU-RA, respectively. In the following section, we provide uniﬁed analyses for all derived algorithms under different regularity conditions. We emphasize that, in the absence of the unbiasedness of the gradient approximation, our

Strategy 4 TMU(A(k), θ(k), k, Sk)

for i = 1 to N do if mod(k + 1, D) = 0 then α(k+1) i = fi(x(k+1)) else α(k+1) i = I{i Sk} fi(x(k)) + I{i/ Sk}α(k) i end if end for return Ak+1

mixture time analyses are the ﬁrst to cover algorithms with the I/O friendly cyclic data accessing scheme.

4 Theoretical Analysis In this section, we provide the mixture time analysis for AGLD. The detailed proofs of the theorems are postponed to the long version of this paper due to the limit of space.

4.1 Analysis for AGLD with strongly convex f(x) We ﬁrst investigate the W2 distance between the sample distribution p(k) of the iterate x(k) and the target distribution p under the smoothness and strong convexity assumptions.

Assumption 1 (Smoothness). Each individual fi is Msmooth. That is, fi is twice differentiable and there exists a constant M > 0 such that for all x, y Rd

fi(y) fi(x) + fi(x), y x + M

2 x y 2 2. (7)

Accordingly, we can verify that the summation f of f is is M-smooth with M = MN. Assumption 2 (Strong Convexity). The sum f is μ-strongly convex. That is, there exists a constant μ > 0 such that for all x, y Rd,

f(y) f(x) + f(x), y x + μ

2 x y 2 2. (8)

Note that these assumptions are satisﬁed by many Bayesian sampling models such as Bayesian ridge regression, Bayesian logistic regression and Bayesian Independent Component Analysis, and they are used in many existing analyses of Langevin dynamics based MCMC methods (Dalalyan 2017b; Baker et al. 2017; Zou, Xu, and Gu 2018b; Chatterji et al. 2018). Theorem 1. Under Assumption 1, 2 and Requirement 2, AGLD outputs sample x(k) with its distribution p(k) satisfying W2(p(k), p ) ϵ for any k K = O(ϵ 2) with η=O(ϵ2). Remark 2. Under this assumption, the ϵ-mixture time K of AGLD has the same dependency on ϵ as that of LMC (Dalalyan 2017b). Note that we hide the dependency of other regularity parameters such as μ, L and N in the O( ) for simplicity. Actually, AGLD methods with CA/RR have a worse dependency on these parameters than algorithms with RA. However, when the dataset does not ﬁt into the memory, the sequential data accessing nature of CA enjoys less I/O cost than random data accessing, which makes CA based AGLD methods have a better time efﬁciency than the RA based ones.

The bound of the mixture time for AGLD with RA can be improved under the Lipschitz-continuous Hessian condition. Assumption 3. [Lipschitz-continuous Hessian] There exists a constant L > 0 such that for all x, y Rd

2f(x) 2f(y) L x y 2.

Theorem 2. Under Assumption 1, 2, 3 and Requirement 2, AGLD methods with RA output sample x(k) with its distribution p(k) satisfying W2(p(k), p ) ϵ for any k K = O(log(1/ϵ)/ϵ) by setting η = O(ϵ). Additionally, when we adopt the random data accessing scheme, the mixture time of the newly proposed TMU-RA method can be written in a more concrete form, which is established in the following theorem. Theorem 3. Under Assumption 1, 2, 3 and denote κ = M/μ. TMU-RA outputs sample x(k) with its distribution p(k) satisfying W2(p(k), p ) ϵ for any k K = O(κ3/2

d/(nϵ)) if we set η < ϵn μ/(M

d N), n 9, and D = N. Remark 3. Note that the component gradient complexity to achieve W2(p(k), p ) ϵ in TMU-RA is Tg = O(N + κ3/2

d/ϵ), which is the same as those of SAGA-LD (Chatterji et al. 2018) and SVRG-LD (Zou, Xu, and Gu 2018b). Practically, in our experiments, TMU based variants always have a better empirically performance than the PPU based and PTU based counterparts as the entries in the snapshot set maintained by TMU is more up-to-date.

4.2 Extension to general convex f(x) Following a similar idea from (Zou, Xu, and Gu 2018a), we can extend AGLD to drawing samples from densities with general convex f(x). Firstly, we construct the following strongly convex approximation ˆf(x) of f(x),

ˆf(x) = f(x) + λ x 2/2.

Then, we run AGLD to generate samples with ˆf(x) until the sample distribution p(K) satisﬁes W2(p(K), ˆp ) ϵ/2 where ˆp e ˆ f(x) denotes stationary distribution of Langevin Dynamics with the drift term ˆf (check 2 for deﬁnition). If we choose a proper λ to make W2(ˆp , p ) ϵ/2, then by the triangle inequality of the W2 distance, we have W2(p(K), p ) W2(p(K), p ) + W2(ˆp , p ) ϵ. Thus, we have the following theorem. Theorem 4. Suppose the assumptions in Theorem 1 hold and further assume the target distribution p e f has bounded forth order moment, i.e. Ep [ x 4 2] ˆUd2. If we choose λ = 4ϵ2/( ˆUd2) and run the AGLD algorithm with ˆf(x) = f(x) + λ x 2/2, we have W2(p(k), p ) ϵ for any k K = O(ϵ 8). If we further assume that f has Lipschitzcontinuous Hessian, then SVRG-LD, SAGA-LD, and TMURA can achieve W2(p(K), p ) ϵ in K = O(ϵ 3) iterations.

4.3 Theoretical results for nonconvex f(x) In this subsection, we characterize the ϵ-mixture time of AGLD for sampling from densities with nonconvex f(x). The following assumption is necessary for our theory.

Assumption 4. [Dissipative] There exists constants a, b > 0 such that for all x Rd, the sum f satisﬁes

f(x), x b x 2 2 a.

This assumption is typical for the ergodicity analysis of stochastic differential equations and diffusion approximations. It indicates that, starting from a position that is sufﬁciently far from the origin, the Langevin dynamics (2) moves towards the origin on average. With this assumption, we establish the following theorem on the nonasymptotic convergence of AGLD for nonconvex f(x). Theorem 5. Under Assumption 1, 4, and Requirement 2, AGLD outputs sample x(k) with distribution p(k) satisfying W2(p(k), p ) ϵ for any k K = O(ϵ 4) with η = O(ϵ4).

Remark 4. This O(ϵ 4) result is similar to the bound for LMC sampling from nonconvex f(x) (Raginsky, Rakhlin, and Telgarsky 2017). Note that, as pointed out by (Raginsky, Rakhlin, and Telgarsky 2017), vanilla SGLD fails to converge in this setting.

5 Related Work In this section, we brieﬂy review the literature of Langevin dynamics based MCMC algorithms. By directly discretizing the Langevin dynamics (2), Roberts, Tweedie, and others (1996) proposed to use LMC (3) to generate samples of the target distribution. The ﬁrst nonasymptotic analysis of LMC was established by Dalalyan (2017b), which analyzed the error of approximating the target distribution with strongly convex f(x) in the total variational distance. This result was soon improved by Durmus, Moulines, and others (2017). Later, Durmus and Moulines (2016) and Cheng and Bartlett (2018) established the convergence of LMC in the 2-Wasserstein distance and KL-divergence, respectively. While the former works focus on sampling from distribution with (strongly-)convex f(x), Raginsky, Rakhlin, and Telgarsky (2017) investigated the nonasymptotic convergence of LMC in the 2-Wasserstein distance when f(x) is nonconvex. With the increasing amount of data size in modern machine learning tasks, SGLD method(Welling and Teh 2011), which replaces the full gradient in LMC with a stochastic gradient (Robbins and Monro 1951), has received much attention. Vollmer, Zygalakis, and others (2015) analyzed the nonasymptotic bias and variance of SGLD using Poisson equations, and Dalalyan and Karagulyan (2017) proved the convergence of SGLD in the 2-Wasserstein distance when the target distribution is strongly log-concave. Despite the great success of SGLD, the large variance of stochastic gradients may lead to unavoidable bias (Baker et al. 2017; Betancourt 2015; Brosse, Durmus, and Moulines 2018). To overcome this, Teh, Thiery, and Vollmer (2016) proposed to decrease the step size to alleviate the bias and proved the asymptotic rate of SGLD in terms of Mean Square Error (MSE). Dang et al. (2019) utilized an approximate MH correction step, which only uses part of the whole data set, to decrease the inﬂuence of variance. Another way to reduce the variance of stochastic gradients and save gradient computation is to apply variance-reduction

Figure 1: Gaussian Mixture Model. The red line denotes the projection of the target distribution p .

Table 1: Statistics of datasets used in our experiments.

DATASET DIMENSION DATASIZE

YEARPREDICTIONMSD 90 515,345 SLICELOACTION 384 53500 CRITEO 999,999 45,840,617 KDD12 54,686,45 149,639,105

techniques. Dubey et al. (2016) used two different variancereduced gradient estimators of f(x), which utilize the component gradient information of the past samples, and devised SVRG-LD and SAGA-LD algorithms. They proved that these two algorithms improve the MSE upon SGLD. Chatterji et al. (2018) and Zou, Xu, and Gu (2019) studied the nonasymptotic convergence of these methods in the 2-Wasserstein distance when sampling from densities with strongly convex and nonconvex f(x), respectively. Their results show that SVRG-LD and SAGA-LD can achieve similar ϵ-mixture time bound as LMC w.r.t. ϵ, while the per-iteration computational cost is similar to that of SGLD. There is another research line which uses the mode of the log-posterior to construct control-variate estimates of full gradients (Baker et al. 2017; Bierkens, Fearnhead, and Roberts 2016; Nagapetyan et al. 2017; Chatterji et al. 2018; Brosse, Durmus, and Moulines 2018). However, calculating the mode is intractable for largescale problems, rendering these methods impractical for realworld Bayesian learning tasks.

6 Experiments

We follow the experiment settings in the literature (Zou, Xu, and Gu 2018b; Dubey et al. 2016; Chatterji et al. 2018; Welling and Teh 2011; Zou, Xu, and Gu 2019) and conduct empirical studies on two simulated experiments (sampling from distribution with convex and nonconvex f, respectively) and two real-world applications (Bayesian Logistic Regression and Bayesian Ridge Regression). Nine instances of AGLD are considered, including SVRG-LD (PTU-RA), PTU-RR, PTU-CA, SAGA-LD (PPU-RA), PPU-RR, PPUCA, TMU-RA, TMU-RR, and TMU-CA. We also include LMC, SGLD, SVGR-LD+ (Zou, Xu, and Gu 2018b), SVRGRR+ and SVRG-CA+1 as baselines. Due to the limit of space, we put the experiment sampling from distribution with convex f in our long version. The statistics of datasets are listed

1SVRG-RR+ is the random reshufﬂe variant of SVRG-LD+, and SVRG-CA+ is the cyclic access variant of SVRG-LD+.

in Table 1.

6.1 Sampling for Gaussian Mixture Distribution

In this simulated experiment, we consider sampling from distribution p exp( f(x)) = exp( N i=1 fi(x)/N), where each component exp( fi(x)) is deﬁned as exp( fi(x)) = e x ai 2 2/2 + e x+ai 2/2, ai Rd. It can be veriﬁed that exp( fi(x)) is proportional to the PDF of a Gaussian mixture distribution. According to (Dalalyan 2017b), when the parameter ai is chosen such that ai 2 1, fi(x) is nonconvex. We set the sample size N = 500 and dimension d = 10, and randomly generate parameters ai N(μ, Σ) with μ = (2, , 2)T and Σ = Id d. In this experiment, we ﬁx the Data-Accessing strategy to RA in AGLD and compare the performance of LMC, SGLD, SVRG-LD, SAGA-LD and TMU-RA algorithms. We run all algorithms for 2 104 data passes, and make use of the iterates in the last 104 data passes to visualize distributions. In Figure 1, we report the 2D projection of the densities of random samples generated by each algorithm. It can be observed that all three AGLD methods, i.e., SVRG-LD, SAGA-LD and TMU-RA, can well approximate the target distribution in 2 104 data passes, while the distributions generated by LMC and SGLD have obvious deviation from the true one. Moreover, the results show that the sample probability of TMU-RA approximates the target distribution best among the three AGLD methods.

6.2 Bayesian Ridge Regression

Bayesian ridge regression aims to predict the response y according to the covariate x, given the dataset Z = {xi, yi}N i=1. The response y is modeled as a random variable sampled from a conditional Gaussian distribution p(y|x, w) = N(w T x, λ), where w denotes the weight variable and has a Gaussian prior p(w) = N(0, λId d). By the Bayesian rule, one can infer w from the posterior p(w|Z) and use it to make the prediction. Two publicly available benchmark datasets are used for evaluation: Year Prediction MSD and Slice Location2. In this task, we ﬁx the Data-Accessing strategy to RA and compare the performance of different Snapshot-Updating strategies. To have a better understanding of the newlyproposed TMU Snapshot-Updating strategy, we also investigate the performance of TMU type methods with different Data-Accessing strategies.

2https://archive.ics.uci.edu/ml/index.php

0 2 4 6 8 10 epoch

LMC SGLD SVRG-LD SAGA-LD TMU-RA SVRG-LD+

0 2 4 6 8 10 epoch

LMC SGLD PTU-RR PPU-RR TMU-RR SVRG-RR+

0 2 4 6 8 10 epoch

LMC SGLD PTU-CA PPU-CA TMU-CA SVRG-CA+

0 2 4 6 8 10 epoch

SGLD TMU-RA TMU-RR TMU-CA

Figure 2: Bayesian Ridge Regression.

0 500 1000 1500 2000 time (s)

average test log-likelihood

criteo (16GB)

LMC SGLD SVRG-LD PTU-CA SAGA-LD PPU-CA TMU-RA TMU-CA SVRG-LD+ SVRG-CA+

0 500 1000 1500 2000 time (s)

-0.65 -0.6 -0.55 -0.5

average test log-likelihood

kdd12 (16GB)

LMC SGLD SVRG-LD PTU-CA SAGA-LD PPU-CA TMU-RA TMU-CA SVRG-LD+ SVRG-CA+

0 500 1000 1500 2000 time (s)

average test log-likelihood

criteo (8GB)

LMC SGLD SVRG-LD PTU-CA SAGA-LD PPU-CA TMU-RA TMU-CA SVRG-LD+ SVRG-CA+

0 600 1200 1800 2400 time (s)

average test log-likelihood

kdd12 (8GB)

LMC SGLD SVRG-LD PTU-CA SAGA-LD PPU-CA TMU-RA TMU-CA SVRG-LD+ SVRG-CA+

Figure 3: Bayesian Logistic Regression.

By randomly partitioning the dataset into training (4/5) and testing (1/5) sets, we report the test Mean Square Error (MSE) of the compared methods on Year Prediction MSD in Fig. 2. The results for Slice Location are similar to that of Year Prediction MSD, and are postponed to the Appendix due to the limit of space. We use the number of effective passes (epoch) of the dataset as the x-axis, which is proportional to the CPU time. From the ﬁrst three columns of the ﬁgure, we can see that (i) TMU-type methods have the best performance among all the methods with the same Data-Accessing strategy, (ii) SVRG+ and PPU type methods constantly outperform LMC, SGLD, and PTU type methods. These results validate the advantage of TMU strategy over PPU and PTU. The last column of Figure 2 shows that TMU-RA outperforms TMU-CA/TMU-RR, when the dataset is ﬁtted to the memory. These results imply that the TMU-RA is the best choice if we have enough memory.

6.3 Bayesian Logistic Regression Bayesian Logistic Regression is a robust binary classiﬁcation task. Let Z = {xi, yi}N i=1 be a dataset with yi { 1, 1} denoting the sample label and xi Rd denoting the sample covariate vector. The conditional distribution of label y is modeled by p(y|x, w) = φ(yiw T xi), where φ( ) is the sigmoid function and the prior of w is p(w) = N(0, λId d). We focus on the big data setting, where the physical memory is insufﬁcient to load the entire dataset. Speciﬁcally, two large-scale datasets criteo (27.32GB) and kdd12 (26.76GB) are used 3 and we manually restrict the available physical

3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets

memory to 16 GB and 8 GB for simulation. We demonstrate that CA strategy is advantageous in such setting by comparing 6 AGLD methods with either CA or RA in the experiment, namely, SVRG-LD, PTU-CA, SAGA-LD, PPU-CA, TMU-RA, and TMU-CA. We also include LMC, SGLD, SVRG-LD+, and SVRG-CA+ as baseline. Methods with the RR strategy have almost identical performance as their RA counterparts and are hence omitted. The average test log-likelihood versus execution time are reported in Fig. 3. The empirical results show that methods with CA outperform their RA counterparts. As the amount of physical memory gets smaller (from 16 GB to 8GB), the time efﬁciency of CA becomes more apparent. The results also show that TMU has better performance than other Snapshot-Updating strategies with the same Data-Accessing strategy.

7 Conclusion and Future Work In this paper, we proposed a general framework called Aggregated Gradient Langevin Dynamics (AGLD) for Bayesian posterior sampling. A uniﬁed analysis for AGLD is provided without the need to design different Lyapunov functions for different methods individually. In particular, we establish the ﬁrst theoretical guarantees for cyclic access and random reshufﬂe based methods. By introducing the new Snapshot Updating strategy TMU, we derive some new methods under AGLD. Empirical results validate the efﬁciency and effectiveness of the proposed TMU in both simulated and real-world tasks. The theoretical analysis and empirical results indicate that TMU-RA would be the best choice if the memory is sufﬁcient and TMU-CA would be used, otherwise.

Acknowledgements This work is supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ18F020002, and National Natural Science Foundation of China (Grant No: 61672376, 61751209, 61472347).

References Baker, J.; Fearnhead, P.; Fox, E. B.; and Nemeth, C. 2017. Control variates for stochastic gradient mcmc. ar Xiv preprint ar Xiv:1706.05439. Betancourt, M. 2015. The fundamental incompatibility of scalable hamiltonian monte carlo and naive data subsampling. In International Conference on Machine Learning, 533 540. Bierkens, J.; Fearnhead, P.; and Roberts, G. 2016. The zig-zag process and super-efﬁcient sampling for bayesian analysis of big data. ar Xiv preprint ar Xiv:1607.03188.

Brosse, N.; Durmus, A.; and Moulines, E. 2018. The promises and pitfalls of stochastic gradient langevin dynamics. In Advances in Neural Information Processing Systems, 8268 8278.

Chatterji, N. S.; Flammarion, N.; Ma, Y.-A.; Bartlett, P. L.; and Jordan, M. I. 2018. On the theory of variance reduction for stochastic gradient monte carlo. ar Xiv preprint ar Xiv:1802.05431.

Cheng, X., and Bartlett, P. L. 2018. Convergence of langevin mcmc in kl-divergence. PMLR 83 (83):186 211.

Dalalyan, A. S., and Karagulyan, A. G. 2017. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. ar Xiv preprint ar Xiv:1710.00095.

Dalalyan, A. S. 2017a. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. ar Xiv preprint ar Xiv:1704.04752.

Dalalyan, A. S. 2017b. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3):651 676.

Dang, K.-D.; Quiroz, M.; Kohn, R.; Tran, M.-N.; and Villani, M. 2019. Hamiltonian monte carlo with energy conserving subsampling. Journal of machine learning research 20(100):1 31.

Dawkins, B. 1991. Siobhan s problem: the coupon collector revisited. The American Statistician 45(1):76 82.

Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, 1646 1654.

Dubey, K. A.; Reddi, S. J.; Williamson, S. A.; Poczos, B.; Smola, A. J.; and Xing, E. P. 2016. Variance reduction in stochastic gradient langevin dynamics. In Advances in Neural Information Processing Systems, 1154 1162.

Durmus, A., and Moulines, E. 2016. High-dimensional bayesian inference via the unadjusted langevin algorithm. ar Xiv preprint ar Xiv:1605.01559.

Durmus, A.; Moulines, E.; et al. 2017. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability 27(3):1551 1587.

Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, 315 323.

Lei, L., and Jordan, M. 2017. Less than a single pass: Stochastically controlled stochastic gradient. In Artiﬁcial Intelligence and Statistics, 148 156.

Mattingly, J. C.; Stuart, A. M.; and Higham, D. J. 2002. Ergodicity for sdes and approximations: locally lipschitz vector ﬁelds and degenerate noise. Stochastic processes and their applications 101(2):185 232.

Nagapetyan, T.; Duncan, A. B.; Hasenclever, L.; Vollmer, S. J.; Szpruch, L.; and Zygalakis, K. 2017. The true cost of stochastic gradient langevin dynamics. ar Xiv preprint ar Xiv:1706.02692.

Parisi, G. 1981. Correlation functions and computer simulations. Nuclear Physics B 180(3):378 384.

Raginsky, M.; Rakhlin, A.; and Telgarsky, M. 2017. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. ar Xiv preprint ar Xiv:1702.03849.

Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. J. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS, 2647 2655.

Robbins, H., and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics 400 407.

Roberts, G. O., and Stramer, O. 2002. Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability 4(4):337 357. Roberts, G. O.; Tweedie, R. L.; et al. 1996. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli 2(4):341 363. Shamir, O. 2016. Without-replacement sampling for stochastic gradient methods. In Neur IPS, 46 54. Teh, Y. W.; Thiery, A. H.; and Vollmer, S. J. 2016. Consistency and ﬂuctuations for stochastic gradient langevin dynamics. Journal of Machine Learning Research 17:1 33. Vollmer, S. J.; Zygalakis, K. C.; et al. 2015. (non-) asymptotic properties of stochastic gradient langevin dynamics. ar Xiv preprint ar Xiv:1501.00438. Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 681 688. Xie, J.; Qian, H.; Shen, Z.; and Zhang, C. 2018. Towards memoryfriendly deterministic incremental gradient method. In International Conference on Artiﬁcial Intelligence and Statistics, 1147 1156. Zou, D.; Xu, P.; and Gu, Q. 2018a. Stochastic variance-reduced hamilton monte carlo methods. ar Xiv preprint ar Xiv:1802.04791. Zou, D.; Xu, P.; and Gu, Q. 2018b. Subsampled stochastic variancereduced gradient langevin dynamics. UAI. Zou, D.; Xu, P.; and Gu, Q. 2019. Sampling from non-log-concave distributions via variance-reduced gradient langevin dynamics. In AISTATS, 2936 2945.