# variance_reduction_in_stochastic_particleoptimization_sampling__33a3c6de.pdf

Variance Reduction in Stochastic Particle-Optimization Sampling

Jianyi Zhang 1 Yang Zhao 2 Ruiyi Zhang 1 Lawrence Carin 1 Changyou Chen 2

Stochastic particle-optimization sampling (SPOS) is a recently-developed scalable Bayesian sampling framework unifying stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD) algorithms based on Wasserstein gradient ﬂows. With a rigorous nonasymptotic convergence theory developed, SPOS can avoid the particle-collapsing pitfall of SVGD. However, the variance-reduction effect in SPOS has not been clear. In this paper, we address this gap by presenting several variancereduction techniques for SPOS. Speciﬁcally, we propose three variants of variance-reduced SPOS, called SAGA particle-optimization sampling (SAGA-POS), SVRG particle-optimization sampling (SVRG-POS) and a variant of SVRGPOS which avoids full gradient computations, denoted as SVRG-POS+. Importantly, we provide non-asymptotic convergence guarantees for these algorithms in terms of the 2-Wasserstein metric and analyze their complexities. The results show our algorithms yield better convergence rates than existing variance-reduced variants of stochastic Langevin dynamics, though more space is required to store the particles in training. Our theory aligns well with experimental results on both synthetic and real datasets.

1. Introduction

Sampling has been an effective tool for approximate Bayesian inference, which is becoming increasingly important in modern machine learning. In the setting of big data, recent research has developed scalable Bayesian sampling algorithms such as stochastic gradient Markov Chain Monte Carlo (SG-MCMC) (Welling & Teh, 2011) and Stein variational gradient descent (SVGD) (Liu & Wang, 2016).

1Duke University 2University at Buffalo, SUNY. Correspondence to: Lawrence Carin <lcarin@duke.edu>, Changyou Chen <changyou@buffalo.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

These methods have facilitated important real-world applications and achieved impressive results, such as in topic modeling (Gan et al., 2015; Liu et al., 2016), matrix factorization (Chen et al., 2014; Ding et al., 2014; S ims ekli et al., 2016), differential privacy (Wang et al., 2015; Li et al., 2017), Bayesian optimization (Springenberg et al., 2016), reinforcement learning (Haarnoja et al., 2018; Zhang et al., 2018a;b; 2019) and deep neural networks (Li et al., 2016). Generally speaking, these methods use gradient information of a target distribution to generate samples, leading to more effective algorithms compared to traditional sampling methods. Recently, (Chen et al., 2018) proposed a particle-optimization Bayesian sampling framework based on Wasserstein gradient ﬂows, which uniﬁed SG-MCMC and SVGD in a new sampling framework called particleoptimization sampling (POS). Furthermore, Zhang et al. (2020) discovered that SVGD endows some unintended pitfalls, i.e., particles tend to collapse under some conditions. As a result, a remedy was proposed to inject random noise into SVGD update equations in the POS framework, leading to stochastic particle-optimization sampling (SPOS) algorithms (Zhang et al., 2020). Remarkably, for the ﬁrst time, non-asymptotic convergence theory was developed for SPOS (SVGD-type algorithms) in (Zhang et al., 2020).

In order to deal with large-scale datasets, many gradientbased methods for optimization and sampling use stochastic gradients calculated on a mini-batch of a dataset, for computational feasibility. Unfortunately, this has the potential of adding extra variance into the algorithms, which may potentially degrade model performance. To address this issue, variance control has been an important and interesting direction of research. Efﬁcient solutions such as SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013) were proposed to reduce variance in stochastic optimization. Subsequently, (Dubey et al., 2016) introduced these techniques in SG-MCMC for Bayesian sampling, which also has achieved great success in practice.

Since SPOS has enjoyed the best of both worlds by combining SG-MCMC and SVGD, it is of great value to further reduce its gradient variance. While both the algorithm and theory have been developed for SPOS, no work has been done to investigate its variance-reduction techniques. Compared with research on SG-MCMC, where variance reduction has been well explored by recent work such as (Dubey

Variance Reduction in Stochastic Particle-Optimization Sampling

et al., 2016; Chatterji et al., 2018; Zou et al., 2018; Chen et al., 2019; Zou et al., 2019), it is much more challenging for SPOS to control the variance of stochastic gradients. This is because from a theoretical perspective, SPOS corresponds to nonlinear stochastic differential equations (SDE), where fewer existing mathematical tools can be applied for theoretical analysis. Furthermore, the fact that many particles are used in the algorithm makes it difﬁcult to improve its performance by adding modiﬁcations to the way they interact with each other.

In this paper, we take a ﬁrst attempt to study variancereduction techniques in SPOS and develop corresponding convergence theory. We adopt recent ideas on variance reduction in SG-MCMC and stochastic-optimization algorithms, and propose three variance-reduced SPOS algorithms, denoted as SAGA particle-optimization sampling (SAGA-POS), SVRG particle-optimization sampling (SVRG-POS) and a variant of SVRG-POS without fullgradient computations, denoted as SVRG-POS+. For all these variants, we prove rigorous theoretical results on their non-asymptotic convergence rates in terms of 2-Wasserstein metrics. Importantly, our theoretical results demonstrate signiﬁcant improvements in convergence rates over standard SPOS. Remarkably, when comparing our convergence rates with those of variance-reduced stochastic gradient Langevin dynamics (SGLD), our theory indicates faster convergence of variance-reduced SPOS when the number of particles is large enough. Our theoretical results are veriﬁed by a number of experiments on both synthetic and real datasets.

2. Preliminaries

2.1. Stochastic gradient MCMC

In Bayesian sampling, one aims at sampling from a posterior distribution p( |x) / p(x| )p( ), where 2 Rd

represents the model parameter, and X , {xj}N

j=1 is the dataset. Let p( |X) = (1/Z) exp( U( )), where

U( ) = log p(X| ) log p( )

log p(xi| ) log p( )

is referred to as the potential energy function, and Z is the normalizing constant. We further deﬁne the full gradient F and individual gradient Fj used in this paper:

Fj( ) , r log p(xj| ) 1

N r log p( )

and F( ) , r U( ) =

We can deﬁne a stochastic differential equation, an instance of It o diffusion, whose stationary distribution equals the

target posterior distribution p( |X). For example, consider the following 1st-order Langevin dynamic:

d t = β 1F( t)dt +

2β 1d Wt , (1)

where t is the time index, Wt 2 Rd is d-dimensional Brownian motion, and β a scaling factor. By the Fokker-Planck equation (Kolmogoroff, 1931; Risken, 1989), the stationary distribution of (1) equals p( |X).

SG-MCMC algorithms are discretized numerical approximations of the It o diffusions (1). To make algorithms feasible in a big-data setting, the computationally-expensive term F is replaced with its unbiased stochastic approximation via a random subset of the dataset in each iteration, e.g. F can be approximated by a stochastic gradient:

= r log p( k) N

r k log p(xj| k) ,

where Ik is a random subset of {1, 2, , N} with size B. The above deﬁnition of Gk reﬂects the fact that we only have information from B N data points in each iteration. This is the source of the variance we seek to reduce. We also note that Gk is used in standard SVGD and SPOS. As an example, SGLD is a numerical solution of (1), with update equation: k+1 = k β 1Gkh +

2β 1h k, where h means the step size and k N(0, I).

2.2. Stein variational gradient descent

Different from SG-MCMC, SVGD initializes a set of particles, which are updated iteratively to approximate a posterior distribution. Speciﬁcally, we consider a set of particles { (i)}M

i=1 drawn from some distribution q. SVGD tries to update these particles by doing gradient descent on the interactive particle system via

(i) (i) + hφ( (i)), φ = arg max

@h KL(q[hφ]||p)}{h=0}

where φ is a function perturbation direction chosen to minimize the KL divergence between the updated density q[hφ] induced by the particles and the posterior p( |X). The standard SVGD algorithm considers F as the unit ball of a vector-valued reproducing kernel Hilbert space (RKHS) H associated with a kernel ( , 0). In such a setting, (Liu & Wang, 2016) shows that

φ( ) = E 0 q[ ( , 0)F( 0) + r 0 ( , 0)]. (2)

When approximating the expectation E 0 q[ ] with an empirical distribution formed by a set of particles { (i)}M

i=1 and adopting stochastic gradients G(i)

j2Ik Fj( (i)

Variance Reduction in Stochastic Particle-Optimization Sampling

we arrive at the following update for the particles:

SVGD applies (3) repeatedly for all the particles.

2.3. Stochastic particle-optimization sampling

In this paper, we focus on the RBF kernel ( , 0) = exp( k 0k2

2 2 ) due to its wide use in both theoretical analysis and practical applications. Hence, we can rewrite the kernel ( , 0) with a simpler function K( ) = exp( k k2

2 2 ). According to the work of Chen et al. (2018); Zhang et al. (2020), the stationary distribution of the t in the following partial differential equation equals p( |X):

@t t =r ( tβ 1F( ) + t EY t K( Y )F(Y )

t(r K t) + β 1r t) . (4)

When approximating the t in (4) with an empirical distribution formed by a set of particles { (i)}M

i=1, Zhang et al. (2020) derive the following diffusion process characterizing the SPOS algorithm.

t = β 1F( (i)

Note that if we set the initial distribution of all the particles (i)

0 to be identical, the system of these M particles is exchangeable. As a result, the distributions of all the (i)

t are identical and can be denoted as t. When solving the above diffusion process with a numerical method and adopting stochastic gradients G(i)

k , one arrives at the SPOS algorithm of Zhang et al. (2020), with the following update equation:

k N(0, I). SPOS applies an update of (6) repeatedly for all the particles (i)

k . Detailed theoretical results for SPOS are reviewed in the Supplementary Material (SM).

3. Variance Reduction in SPOS

In standard SPOS, each particle is updated by adopting G(i)

j2Ik Fj( (i)

k ). Because one can only access

B N data points in each step, the increased variance of the noisy gradient G(i)

k causes a slower convergence rate. A simple way to alleviate this is to increase B by using larger minibatches. Unfortunately, this brings more computational costs, an undesired side effect. Thus more effective variance-reduction methods are needed for SPOS. Inspired by recent work on variance reduction in SGLD, e.g., (Dubey et al., 2016; Chatterji et al., 2018; Zou et al., 2018), we develop three different variance-reduction algorithms for SPOS based on SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013) from stochastic optimization.

3.1. SAGA-POS

SAGA-POS generalizes the idea of SAGA (Defazio et al., 2014) to an interactive particle-optimization system. For each particle (i)

k , we use {g(i)

j=1 as an approximation

for each individual gradient Fj( (i)

k ). An unbiased estimate of the full gradient F( (i)

k ) is calculated as:

k,j), 8i (7)

where Ik represents the set of data in mini-batch k. In each iteration, {g(i)

j=1 will be partially updated under the

following rule: g(i)

k+1,j = Fj( (i)

k ) if j 2 Ik, and g(i)

k,j otherwise. The algorithm is described in Algorithm 3.1.

Compared with standard SPOS, SAGA-POS also enjoys high computational efﬁciency, as it does not require calculation of each Fj( (i)

k ) to get the full gradient F( (i)

k ) in each iteration. Hence, the computation time of SAGA-POS is almost the same as that of POS. However, our analysis in Section 4 shows that SAGA-POS enjoys a better convergence rate.

From another perspective, SAGA-POS has the same drawback as SAGA-based algorithms, which requires memory scaling at a rate of O(MNd) in the worst case. For each particle (i)

k , one needs to store N gradient approximations {g(i)

j=1. Fortunately, as pointed out by Dubey et al. (2016); Chatterji et al. (2018), in some applications the memory cost scales only as O(N) for SAGA-LD, which corresponds to O(MN) for SAGA-POS as M particles are used.

Remark 1 When compared with SAGA-LD, we note M particles are used in both SPOS and SAGA-POS. This makes the memory complexity is M times worse than SAGA-LD in training, thus SAGA-POS does not seem to bring any advantages over SAGA-LD. However, this intuition is misleading. As indicated by our theoretical results in Section 4, when

the number of particles M is large enough, the convergence

Variance Reduction in Stochastic Particle-Optimization Sampling

Algorithm 1 SAGA-POS

Input: A set of initial particles { (i)

i=1, each (i)

0 2 Rd, step size hk, batch size B. Initialize {g(i)

j=1 = {Fj( (i)

j=1 for all i 2 {1, ..., M};

1: for iteration k= 0,1,...,T do 2: Uniformly sample Ik from {1, 2, ..., N} randomly with replacement such that |Ik| = B;

3: Sample (i)

k N(0, Id d), 8i ;

4: Update G(i)

5: Update each (i)

k with Eq.(6);

6: Update {g(i)

j=1, 8i : if j 2 Ik, set g(i)

k ); else, set g(i)

k,j 7: end for Output:{ (i)

rates of our algorithms are actually better than those of variance-reduced SGLD counterparts.

3.2. SVRG-POS

Under limited memory, we propose SVRG-POS, which is based on the SVRG method of Johnson & Zhang (2013). For each particle (i)

k , one needs to store a stale parameter e (i), and update it occasionally every iterations. At each update, we need to further conduct a global evaluation of full gradients at e (i), i.e., e G(i) F( (i)

k ) = F(e (i)). An unbiased gradient estimate is then calculated by leveraging both e G(i) and e (i) as:

k e G(i) + N

k ) Fj(e (i))] (8)

The algorithm is depicted in Algorithm 3.2, where one only needs to store e (i) and e G(i), instead of gradient estimates of all the individual Fj. Hence the memory cost scales as O(Md), almost the same as that of standard SPOS.

Although SVRG-POS alleviates the storage requirements of SAGA-POS signiﬁcantly, it also has the downside that the full gradients, F(e (i)) = PN

j=1 F(e (i)), need to be recomputed every iterations, leading to high computation cost in a big-data scenario.

Remark 2 i) Similar to SAGA-POS, according to our theory in Section 4, SVRG-POS enjoys a faster convergence rate than SVRD-LD its SGLD counterpart, although M times more space is required for the particles. This provides a trade-off between convergence rates and space complexity. ii) Previous work has shown that SAGA typically outper-

Algorithm 2 SVRG-POS

Input: A set of initial particles { (i)

i=1, each (i)

0 2 Rd, step size h, epoch length , batch size B. Initialize {e (i)} { (i)

0 }, e G(i) F( (i)

1: for iteration k= 0,1,...,T do 2: if k mod =0 then 3: Option I i)Sample l unif(0, 1, .., 1) ii)Update e (i) (i)

k l Update (i)

k e (i), 8i iii)Update e G(i) F( (i)

4: Option II i) Update e (i) (i)

k ii)Update e G(i) F( (i)

k ), 8i 5: end if 6: Uniformly sample Ik from {1, 2, ..., N} randomly with replacement such that |Ik| = B;

7: Sample (i)

k N(0, Id d), 8i ;

8: Update G(i)

k e G(i) + N B

Fj(e (i))], 8i ;

9: Update each (i)

k with Eq.(6) 10: end for Output:{ (i)

forms SVRG (Dubey et al., 2016; Chatterji et al., 2018) in terms of convergence speed. This conclusion applies to our case, which will be veriﬁed both by theoretical analysis in Section 4 and experiments in Section 5.

3.3. SVRG-POS+

The need for full gradient computation in SVRG-POS motivates the development of SVRG-POS+. Our algorithm is also inspired by the recent work of SVRG-LD+ on reducing the computational cost in SVRG-LD (Zou et al., 2018). The main idea in SVRG-POS+ is to replace the full gradient computation every iterations with a subsampled gradient, i.e., to uniformly sample |Jk| = b data points where Jk are random samples from {1, 2, ..., N} with replacement. Given the sub-sampled data, e (i) and e G(i) are updated as:

e (i) = (i)

k , e G(i) = N

j2Jk Fj( (i)

k ). The full algorithm is shown in Algorithm 3.3.

4. Convergence Analysis

We prove non-asymptotic convergence rates for the SAGAPOS, SVRG-POS and SVRG-POS+ algorithms under the 2-Wasserstein metric, deﬁned as

inf 2Γ(µ, )

Rd Rd k Xµ X k2d (Xµ, X )

Variance Reduction in Stochastic Particle-Optimization Sampling

Algorithm 3 SVRG-POS+

Input : A set of initial particles { (i)

i=1, each (i)

0 2 Rd, step size h, epoch length , batch size B. Initialize {e (i)} { (i)

0 }, e G(i) F( (i)

1: for iteration k= 0,1,...,T do 2: if k mod =0 then 3: i) Uniformly sample Jk from {1, 2, ..., N} with replacement such that |Jk| = b; ii) Update e (i) (i)

j2Jk Fj( (i)

k ), 8i ; 4: end if 5: Uniformly sample Ik from {1, 2, ..., N} with replacement such that |Ik| = B;

6: Sample (i)

k N(0, Id d), 8i ;

7: Update G(i)

k e G(i) + N B

Fj(e (i))], 8i ;

8: Update each (i)

k with Eq.(6) 9: end for Output:{ (i)

where Γ(µ, ) is the set of all coupling of µ and on Rd Rd, with marginal distributions matching µ and . Let µ denote our target distribution, and µT the distribution of 1 M

T derived via (5) after T iterations. Our analysis aims to bound W2(µT , µ ). We ﬁrst introduce our assumptions.

Assumption 1 F and K satisfy the following conditions:

There exists a positive constant m F such that h F( )

F( 0), 0i m F k 0k2; F is LF -Lipschitz continuous, i.e., k F( ) F( 0)k LF k 0k;

K is LK-Lipschitz continuous for some LK 0,

i.e., k K( ) K( 0)k LKk 0k; and r K is Lr K-Lipschitz continuous for some Lr K 0, i.e., kr K( ) r K( 0)k Lr Kk 0k;

K is an even function, i.e., K( ) = K( );

The initial probability law of each particle has a

bounded and strictly positive density 0 with respect to the Lebesgue measure on Rd, and γ0 , log

Rd ek k2 0( )d < 1

Assumption 2 There exists a constant DF > 0 such that kr F( ) r F( 0)k DF k 0k.

Assumption 3 There exits a constant σ such that for all j 2 {1, 2, ..., N},

E[k Fj( ) 1

Fj( )k2] dσ2/N 2

Assumption 4 There exist some positive constants H1, H2 such that k F( 2)K( 0

1 1) F( 2)K( 0

2 2)k H1LKk 0

2 2))k and k F( 2)r K( 0

1 1) F( 2)r K( 0

2 2)k H2Lr Kk 0

Remark 3 i) Assumption 1 is adopted from (Zhang et al., 2020). The ﬁrst bullet of Assumption 1 suggests U( ) is a strongly convex function, which is the general assumption in analyzing SGLD (Dalalyan & Karagulyan, 2017; Durmus & Moulines, 2016) and its variance-reduced variants (Zou et al., 2018; Chatterji et al., 2018). Although some work has been done on investigating the non-convex case, the convex case is a more common case, which is more instructive and meaningful for addressing practical issues (Dalalyan & Karagulyan, 2017; Durmus & Moulines, 2016; Zou et al., 2018; Chatterji et al., 2018). ii) All of the m F , LF and DF can scale linearly with N. iii) K( ) = exp( k k2

2 2 ) can satisfy the above assumptions by setting the bandwidth large enough. K can also be expressed as Hessian Lipschitz with some positive constant Dr2K, and kr Kk can be bounded by some positive constant Hr K. iv) Assumption 4 can be viewed as an extension to the Lipschitz continuity mentioned in Assumption 1, and it is used to bridge the work of (Chatterji et al., 2018) and (Zhang et al., 2020). We allow H1, H2 to be related to F, which can scale linearly with N.

Now we present a convergence analysis for our algorithms, where is some positive constant independent of T.

Theorem 1 Let µT denote the distribution of the particles after T iterations with SAGA-POS, and consider step size h < B 8C2N and batch size B 9. Under Assumptions 1 and 2, the convergence rate of SAGA-POS is bounded as

W2(µT ,µ ) C1 p

+ 5 exp( C3h

4 T)W2(µ0, µ )

+ 2h C4d M 1/2

d C3M + 24h C2

d N M p C3B ,

where (C1, C2, C3, C4) are some positive constant presented in the appendix.

Theorem 2 Let µT denote the distribution of the particles after T iterations with SVRG-POS in Algorithm 3.2. Under Assumptions 1 and 2, if we choose Option I and set the

step size h < 1 8C2 , the batch size B 2 and the epoch length = 4 h C3(1 2h C2(1+2/B)), the convergence rate of SVRG-POS is bounded, for all T such that T mod equal

Variance Reduction in Stochastic Particle-Optimization Sampling

W2(µT , µ ) C1 p

W2(µ0, µ ) (10)

+ 2h C4d M 1/2

d C3M + 64C2

If we choose Option II and set the step size h <

B 4 C2 , the convergence rate of SVRG-POS is bounded for all T as

W2(µT , µ ) C1 p

4 T)W2(µ0, µ ) (11)

2h C4d M 1/2

d C3M + 9h C2

Theorem 3 Let µT denote the distribution of particles after T iterations with SVRG-POS+. Under Assumptions 1, 2 and 3, if we set the step size h min{( BC3 24C24 2 )

1 3 , 1 6 (C52/b+C2)}, the convergence rate of SVRG-POS+ is bounded for all T as

W2(µT ,µ ) C1 p

+ (1 h C3/4)T W2(µ0, µ ) (12)

M C3b1/2 1(b N) + 2h(C4d M 1/2 )

C3M + 4h C2( d)1/2 3h1/2d1/2C5

Since the complexity has been discussed in Section 3, we mainly focus on discussing the convergence rates here. Due to space limits, we move the comparison between convergence rates of the standard SPOS and its variance-reduced counterparts (such as SAGA-POS) into the SM. Speciﬁcally, adopting the standard framework of comparing different variance-reduction techniques in SGLD (Dubey et al., 2016; Chatterji et al., 2018; Zou et al., 2018), we focus on the scenario where mf, LF , HF and DF all scale linearly with N where N d. In this case, the last term in Theorem 1 dominates for SAGA-POS, O( h C2

d M B ) O( h N

d M B ). Thus, to achieve an accuracy of ", we need a stepsize of hag = O( "M B

d ). For SVRG-POS, the dominate term in

Theorem 2 is O(

B ) for Option I and O( h N

B ) for Option II. Hence, for an accuracy of ", the corresponding step sizes are hvr1 = O( "2M 2 B

Nd ) and hvr2 = O( "M p

d ), respectively. Due to the fact that the mixing time T for these methods is roughly proportional to the reciprocal of step size (Chatterji et al., 2018), it is seen that when " is small enough, one can have hvr1 hag, which causes SAGAPOS to converge faster than SVRG-POS (Option I). Similar results hold for Option II since the factor 1 p

B in hvr2 would make the step size even smaller. More theoretical results are given in the SM.

Figure 1. Log-MSE of the mean parameter versus the number of

dataset pass.

Remark 4 We have provided a theoretical analysis to support the statement of i) in Remark 2. Moreover, we should also notice in SAGA-POS, stepsize hag = O( "M B

an extra factor, M , compared with the step size O( "B

d) used in SAGA-LD (Chatterji et al., 2018)1. This means SAGA-POS with more particles (M is large) could outperform SAGA-LD. SVRG-POS and SVRG-POS+ yield similar conclusions. This provides theoretical support for the statements of Remark 1 and i) in Remark 2. Furthermore, an interesting result from the above discussion is that when hvr1 = O( "2M 2 B

Nd ) in SVRG-POS, there is an extra factor M compared to the stepsize O( "2B

Nd ) in SVRG-LD (Chatterji et al., 2018). Since the order of M 2 is higher than M , one expects that the improvement of SVRG-POS over SVRG-LD is more signiﬁcant than that of SAGA-POS over SAGA-LD. This conclusion is veriﬁed in our experiments.

5. Experiments

We conduct experiments to verify our theory, and compare SAGA-POS, SVRG-POS and SVRG-POS+ with existing representative Bayesian sampling methods with/without variance-reduction techniques, e.g. SGLD and SPOS without variance reduction; SAGA-LD, SVRG-LD and SVRGLD+ with variance reduction. For SVRG-POS, we focus on Option I in Algorithm 3.2 to verify our theory.

5.1. Synthetic log-normal distribution

We ﬁrst evaluate our proposed algorithms on log-normal synthetic data, drawn from p(x|µ) = 1 x

2 exp( (ln x µ)2

2 ), where x, µ 2 R10. Like other variance-reduction algorithms (Chatterji et al., 2018), we calculate log-MSE of the sampled mean w.r.t. the true value, and plot the log-

1For fair comparisons with our algorithms, we consider variance-reduced versions of SGLD with M independent chains.

Variance Reduction in Stochastic Particle-Optimization Sampling

MSE versus number of passes through data. The results are plotted in Figure 1, which shows that SAGA-POS and SVRG-POS converge the fastest among all algorithms. It is also interesting to see SPOS even outperforms both SAGALD and SVRG-LD.

5.2. Bayesian logistic regression

Following related work in (Dubey et al., 2016), we test the proposed algorithms for Bayesian-logistic-regression (BLR) on four publicly available datasets from the UCI machine learning repository: Australian (690-14), Pima (768-8), Diabetic (1151-20) and Susy (100000-18), where (N d) means a dataset of N data points with dimensionality d. The ﬁrst three datasets are relatively small, and the last one is large and is suitable for evaluating scalable Bayesian sampling algorithms.

Consider a dataset {xi, yi}N

i=1 with N samples, where xi 2 Rd and yi 2 {0, 1}. The likelihood of a BLR model is written as p(yi = 1|Xi, ) = sigmoid( T Xi) with regression coefﬁcient 2 Rd, which for simplicity is assumed to be sampled from a standard multivariate Gaussian prior N(0, I). The datasets are split into 80% training data and 20% testing data. Optimized constant stepsizes are applied for each algorithm via grid search. Following existing work, we report testing accuracy and log-likelihood versus the number of data passes for each dataset, averaging over 10 runs with 50 particles. The minibatch size is set to 15 for all experiments.

5.2.1. VARIANCE-REDUCED SPOS VERSUS SPOS

We ﬁrst compare SAGA-POS, SVRG-POS and SVRGPOS+ with SPOS without the variance reduction proposed in (Zhang et al., 2020). The testing accuracies and loglikelihoods versus number of passes through data on the four datasets are plotted in Figure 2. It is observed that SAGA-POS converges faster than both SVRG-POS and SVRG-POS+, all of which signiﬁcantly outperform SPOS. On the largest dataset SUSY, SAGA-POS starts only after one pass through the data, which then converges quickly, outperforming the other algorithms. SVRG-POS+ outperforms SVRG-POS because the dataset SUSY is large that SVRG-POS+ only requires minibatch calculations. All of these phenomena are supported by our theory.

5.2.2. VARIANCE-REDUCED SPOS VERSUS

VARIANCE-REDUCED SGLD

Next we compare the three variance-reduced SPOS algorithms with SGLD counterparts, i.e., SAGA-LD, SVRGLD and SVRG-LD+. The results are plotted in Figure 3. Similar phenomena are observed, where both SAGA-POS and SVRG-POS outperform SAGA-LD and SVRG-LD, respectively, consistent with our theoretical results discussed

((a)) Australian

((c)) Diabetic

Figure 2. Testing accuracy and log-likelihood vs. the number of

data pass for SPOS and its variance-reduction variants.

in Remarks 1 and 2. Interestingly, for the PIMA dataset, SVRG-LD is observed to perform even worse (converges slower) than standard SGLD. Furthermore, as discussed in Remark 4, our theory indicates that the improvement of SVRG-POS over SVRG-LD is more signiﬁcant than that of SAGA-POS over SAGA-LD. This is indeed veriﬁed by inspecting the plots in Figure 3.

5.2.3. IMPACT OF NUMBER OF PARTICLES

Finally, we examine the impact of the number of particles on the convergence rates. As indicated by Theorems 1-3, for a ﬁxed number of iterations T, the convergence error in terms

Variance Reduction in Stochastic Particle-Optimization Sampling

((a)) Australian

((c)) Diabetic

((d)) Susy Figure 3. Testing accuracy and log-likelihood versus the number of dataset pass for variance-reduced SPOS and SGLD.

of 2-Wasserstein distance decreases with increasing number of particles. To verify this, we run SAGA-POS and SVRGPOS for BLR with the number of particles ranging among {1, 2, 4, 8, 16}. The test log-likelihoods versus iteration numbers are plotted in Figure 4. The results indeed are consistent with our theory.

6. Additional Theoretical Discussion for

SAGA-POS, SVRG-POS and SVRG-POS+

We discuss the mixing time and gradient complexity of our algorithms. The mixing time is the number of iterations needed to provably have error less than ", measured in W2 distance (Chatterji et al., 2018). The gradient complexity (Zou et al., 2018), which is almost the same as the computational complexity in (Chatterji et al., 2018), is deﬁned as the required number of stochastic gradient evaluations to achieve a target accuracy ". In Table 1 we present the mixing time and gradient complexity of several related algorithms. We focus on Option I of SVRG-POS. Our results for SVRG-LD+ and SVRGPOS+ may be a little different from those reported in (Zou et al., 2018) since we adopt different deﬁnitions for Fj.

Note that the result for SVRG-POS+ is derived by adopting B = 1 and b = O(dσ2/µ2"2) from (Zou et al., 2018), which also sheds light on the optimal choice of b and B in our SVRG-POS+. For fair comparisons with our algorithms, we consider variance-reduced versions of SGLD with M independent chains, which can be shown to have the same convergence rate in terms of total number of up-

dates (Chen et al., 2015). Hence, the gradient complexities of the SAGA-LD, SVRG-LD and SVRG-LD+ need to be scaled by M, consistent with the discussion in Section 3 and our experiment results. Since the convergence guarantees in Theorems 1, 2 and 3 are developed with respect to both iteration T and the number M, we deﬁne the thresholdparticle, which means the number of particles needed to provably have error less than " measured in W2 distance. We use threshold-particle for our algorithms.

Note that the M in the mixing time from Table 1 also should satisfy the result that M C2

1/"2. In practice, since C1 = 2(Hr K+H ) p

2 H LK LF 2Lr K), we set β to be small enough to avoid the threshold-particle to be too large. However, in our experiments, since HF ,LK and LF are not large, this issue seems to be less of a problem.

Finally, we give explanations concerning the SAGA-POS algorithm. From Algorithm 3.1, it may be noted that one must store all elements like G(i)

k in each iteration. It is known that {G(i)

i=1 only scales as O(Md). Taking the dataset Susy as an example, we have N = 10000 and d = 18. In practice, we only use M 40 particles, resulting in a not-so-large G(i)

k term. Hence, we do not take {G(i)

i=1 into consideration.

7. Conclusions

We propose several variance-reduction techniques for stochastic particle-optimization sampling. For the ﬁrst time, we develop nonasymptotic convergence theory for the algorithms in terms of 2-Wasserstein metrics. Our theoretical

Variance Reduction in Stochastic Particle-Optimization Sampling

((a)) Australian

((b)) Pima Figure 4. Testing log-likelihood versus number of iterations with different number of particles for variance-reduced SPOS.

Table 1. Mixing Time and Gradient Complexity

Algorithm Mixing time Gradient complexity

SAGA-LD O( (LF /m F )3/2p

d B" ) O(N + (LF /m F )3/2p

SAGA-POS O( (C2/C3)3/2p

d BM " ) O(NM + (C2/C3)3/2p

SVRG-LD O( (LF /m F )3d

B"2 ) O(N + (LF /m F )3p

SVRG-POS O( (C2/C3)3d

BM 2 "2 ) O(NM + (C2/C3)3p

SVRG-LD+ O( σ2d m F 2"2 ) O( σ2d m F 2"2 (N + (LF /m F )3/2p

SVRG-POS+ O( C2

4d M 2 C32"2 ) O( C2

4d C32"2 (NM + (C2/C3)3/2p

Table 2. Threshold-particle

Algorithm Threshold-particle SAGA-POS C2

SVRG-POS C2

SVRG-POS+ C2

results indicate the improvement of convergence rates for the proposed variance-reduced SPOS compared to both standard SPOS and the variance-reduced SGLD algorithms. Our theory is veriﬁed by a number of experiments on both synthetic data and real data for Bayesian logistic regression. Leveraging both our theory and empirical ﬁndings, we recommend the following algorithm choices in practice: i) SAGA-POS is preferable when storage is not a concern; ii) SVRG-POS is a better choice when storage is a concern and full gradients are feasible to calculate; and iii) SVRG-POS+

is a good choice and works well in practice when one faces with both computation and storage limitations.

Acknowledgements

The research performed at Duke University was supported in part by DARPA, DOE, NSF and ONR.

Chatterji, N., Flammarion, N., Ma, Y.-A., Bartlett, P., and

Jordan, M. On the theory of variance reduction for stochastic gradient monte carlo. ICML, 2018.

Chen, C., Ding, N., and Carin, L. On the convergence of

stochastic gradient MCMC algorithms with high-order integrators. In NIPS, 2015.

Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L.

A uniﬁed particle-optimization framework for scalable Bayesian sampling. In UAI, 2018.

Chen, C., Wang, W., Zhang, Y., Su, Q., and Carin, L. A

convergence analysis for a class of practical variancereduction stochastic gradient MCMC. Science China Information Sciences, 62, 2019.

Chen, T., Fox, E. B., and Guestrin, C. Stochastic gradient

Hamiltonian Monte Carlo. In ICML, 2014.

S ims ekli, U., Badeau, R., Cemgil, A. T., and Richard, G.

Stochastic Quasi-Newton Langevin Monte Carlo. In ICML, 2016.

Dalalyan, A. S. and Karagulyan, A. User-friendly guaran-

tees for the langevin monte carlo with inaccurate gradient. arxiv preprint arxiv:1710.00095v2, 2017.

Variance Reduction in Stochastic Particle-Optimization Sampling

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. Nips, 2014.

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and

Neven, H. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.

Dubey, A., Reddi, S. J., P oczos, B., Smola, A. J., and Xing,

E. P. Variance reduction in stochastic gradient Langevin dynamics. In NIPS, 2016.

Durmus, A. and Moulines, E. High-dimensional bayesian

inference via the unadjusted langevin algorithm. ar Xiv preprint ar Xiv:1605.01559, 2016.

Gan, Z., Chen, C., Henao, R., Carlson, D., and Carin, L.

Scalable deep Poisson factor analysis for topic modeling. In ICML, 2015.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Rein-

forcement learning with deep energy-based policies. In ICML, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient

descent using predictive variance reduction. NIPS, 2013.

Kolmogoroff, A. Some studies in machine learning using

the game of checkers. Mathematische Annalen, 104(1): 415 458, 1931.

Li, B., Chen, C., Liu, H., and Carin, L. On connecting stochastic gradient MCMC and differential privacy. Technical Report ar Xiv:1712.09097, 2017. URL http: //arxiv.org/abs/1712.09097.

Li, C., Chen, C., Carlson, D., and Carin, L. Preconditioned

stochastic gradient Langevin dynamics for deep neural networks. In AAAI, 2016.

Liu, C., Zhu, J., and Song, Y. Stochastic gradient geodesic

MCMC methods. In NIPS, 2016.

Liu, Q. and Wang, D. Stein variational gradient descent: A

general purpose bayesian inference algorithm. In NIPS, 2016.

Risken, H. The Fokker-Planck equation. Springer-Verlag,

New York, 1989.

Soheil Feizi, Changho Suh, F. X. and Tse, D. Understanding

gans: the lqg setting. https://arxiv.org/abs/1710.10793.

Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F.

Bayesian optimization with robust Bayesian neural networks. In NIPS, 2016.

Wang, Y. X., Fienberg, S. E., and Smola, A. Privacy for

free: Posterior sampling and stochastic gradient Monte Carlo. In ICML, 2015.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic

gradient Langevin dynamics. In ICML, 2011.

Zhang, J., Zhang, R., and Chen, C. Stochastic particle-

optimization sampling and the non-asymptotic convergence theory. In AISTATS, 2020.

Zhang, R., Chen, C., Li, C., and Duke, L. C. Policy opti-

mization as wasserstein gradient ﬂows. In ICML, 2018a.

Zhang, R., Li, C., Chen, C., and Carin, L. Learning struc-

tural weight uncertainty for sequential decision-making. In AISTATS, 2018b.

Zhang, R., Wen, Z., Chen, C., and Carin, L. Scalable thompson sampling via optimal transport. In AISTATS, 2019.

Zou, D., Xu, P., and Gu, Q. Subsampled stochastic variance-

reduced gradient langevin dynamics. UAI, 2018.

Zou, D., Xu, P., and Gu, Q. Stochastic gradient hamiltonian

monte carlo methods with recursive variance reduction. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 3835 3846. 2019.