# faster_sampling_via_stochastic_gradient_proximal_sampler__2944b6c9.pdf

Faster Sampling via Stochastic Gradient Proximal Sampler

Xunpeng Huang 1 Difan Zou 2 Hanze Dong 3 Yi-An Ma 4 Tong Zhang 5

Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems. However, the proximal sampler, which exhibits much faster convergence than Langevin-based algorithms in the deterministic setting (Lee et al., 2021), has yet to be explored in its stochastic variants. In this paper, we study the Stochastic Proximal Samplers (SPS) for sampling from non-log-concave distributions. We first establish a general framework for implementing stochastic proximal samplers and establish the convergence theory accordingly. We show that the convergence to the target distribution can be guaranteed as long as the second moment of the algorithm trajectory is bounded and restricted Gaussian oracles can be well approximated. We then provide two implementable variants based on Stochastic gradient Langevin dynamics (SGLD) and Metropolisadjusted Langevin algorithm (MALA), giving rise to SPS-SGLD and SPS-MALA. We further show that SPS-SGLD and SPS-MALA can achieve ϵsampling error in total variation (TV) distance within O(dϵ 2) and O(d1/2ϵ 2) gradient complexities, which outperform the best-known result by at least an O(d1/3) factor. This enhancement in performance is corroborated by our empirical studies on synthetic data with various dimensions, demonstrating the efficiency of our proposed algorithm.

1The Hong Kong University of Science and Technology 2The University of Hong Kong 3Salesforce AI Research 4University of California, San Diego 5University of Illinois Urbana-Champaign. Correspondence to: Xunpeng Huang <xhuangck@connect.ust.hk>, Difan Zou <dzou@cs.hku.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Sampling from a target distribution p exp( f) is a fundamental problem in many research fields such as statistics (Neal, 1993), scientific computing (Robert et al., 1999), and machine learning (Bishop & Nasrabadi, 2006). Here, f : Rd R is referred to as the negative log-density function or energy function of p . To solve this problem, the Langevin-based sampling algorithms, based on discretizing the continuous-time Langevin dynamics, are the most popular choices, including Unadjusted Langevin Algorithm (ULA) (Neal, 1992; Roberts & Tweedie, 1996), Underdamped Langevin Dynamic (ULD) (Cheng et al., 2018; Ma et al., 2021; Mou et al., 2021). These algorithms have been extensively investigated both theoretically and empirically. Notably, Langevin-based algorithms are usually biased, i.e., the stationary distribution of ULA and ULD (which are also Markov processes), will be different from the target distribution p , and the error is governed by the discretization step size. Thus, Metropolis-adjusted Langevin Algorithm (MALA) (Roberts & Stramer, 2002; Xifara et al., 2014) was designed to resolve this issue.

To achieve the unbiasedness for sampling, Proximal sampler, similar to proximal point methods in convex optimization, has been recently developed in Lee et al. (2021). In particular, the core idea of the proximal sampler is to first construct a joint distribution

p (x, y) exp f(x) x y 2 /(2η) (1)

whose x-marginal distribution is the same as p . Then, the iterations follow from the two stages:

From a given x, sample y|x p (y|x) = N(x, I). From a given y, sample x|y p (x|y) satisfying

p (x|y) exp f(x) x y 2 /(2η) .

It can be noted that the second stage can be easily implemented even in the non-log-concave setting (i.e., f(x) is nonconvex), since the target distribution, i.e., p (x|y), is strongly log-concave when η is properly small. Under this condition, the proximal sampler achieves a linear convergence rate for different criteria (Chen et al., 2022) when the proximal oracle can be accessed.

Despite the impressive performance of proximal samplers in the deterministic setting, where full access to the function

Faster Sampling via Stochastic Gradient Proximal Sampler

f(x) and its gradient f(x) is available, their behavior remains largely unexplored in the stochastic setting. In this context, we can only access a stochastic version of f and f(x) at each step. This is particularly relevant in scenarios where the target distribution p is formulated as the posterior of a stochastic process based on multiple observations or training data points. In such cases, the negative log-density function takes the finite-sum form: f(x) = 1

n Pn i=1 fi(x), where n denotes the number of observations and fi(x) denotes the corresponding negative log-density function1. To reduce the high per-step computational complexity for calculating the full gradient, the mini-batch stochastic gradient has become a standard choice. In the realm of Langevinbased algorithms, extensive research has been conducted on their stochastic counterparts. Various stochastic gradient Langevin algorithms, including stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) and stochastic gradient ULD (SG-ULD) (Cheng et al., 2018), have been developed. Moreover, the convergence guarantees of these algorithms have been well-established for both log-concave and non-log-concave target distributions.

However, to the best of our knowledge, no prior attempts have been made to study the stochastic gradient proximal sampler, encompassing both algorithm design and theoretical analysis. Consequently, there exists a considerable gap in understanding how the proximal sampler can be effectively adapted to the stochastic setting and what convergence rates can be achieved. This unexplored research question impedes the broader application of the proximal sampler in various tasks, hindering its full potential utilization.

In this paper, we aim to systematically answer this question by providing a comprehensive study of the stochastic gradient proximal sampler. First, we provide a framework for implementing stochastic proximal samplers, the idea is to replace the original joint target distributions with a randomized one:

p (x, y|b) exp fb(x) x y 2 /(2η) ,

where b is the stochastic mini-batch that is randomly sampled in different iterations. The two-stage alternating sampling process for p (y|x, b) (a Gaussian-type distribution) and x from p (x|y, b) (sampling a log-concave distribution) will be performed accordingly. By applying different numerical samplers for p (x|y, b), we are able to design various stochastic proximal samplers. Then, we develop the theory to characterize the convergence of the stochastic proximal samplers. The core of our analysis is to sharply quantify the error propagation across multiple iterations. In particular, the sampling error within one step stems from (1) inexact target p (x|y, b) caused by stochastic mini-batch;

1We consider the average for consistency with Raginsky et al. (2017); Zou et al. (2021).

(2) inexact sampling for p (x|y, b) caused by numerical samplers. Then, by designing proper initialization when sampling from p (x|y, b), the error propagation can be controlled by the second moment of particles underlying distributions rather than requiring the stationary points of f as previous analysis (Altschuler & Chewi, 2023). When p only satisfies LSI, its negative log-density f will even be nonconvex, which means finding an ϵ-approximate stationary points requires O(ϵ 4) oracles with stochastic gradient descent, which is unacceptable in sampling tasks. Besides, by controlling the second moment bound, we provide the gradient complexity expectation for the convergence, which is stronger than a high probability convergence shown in Altschuler & Chewi (2023). Based on our theory, we can develop the convergence guarantees for a variety of stochastic proximal samplers, when the target distribution is log-smooth and satisfies Log-Sobolev Inequality (LSI). We summarize the main contributions of this paper as follows:

We propose a framework for implementing stochastic proximal samplers. We then provide a general theory to characterize the convergence of stochastic proximal samplers for a general class of target distributions (that can be non-log-concave). We show that with feasible choices of the mini-batch size and learning rate, the stochastic proximal samplers provably converge to the target distributions with a small total variation (TV) distance. Notably, compared with Altschuler & Chewi (2023), our framework is more practical since it does not require the stationary point information of f and replaces the high probability convergence results with expectation ones.

Based on the developed framework, we consider two implementations of stochastic proximal samplers using SGLD and warm-started MALA for sampling p (x|y, b), giving rise to SPS-SGLD and SPS-MALA algorithms. We prove that in order to achieve ϵ sampling error in TV distance, the gradient complexities of SPS-SGLD and SPS-MALA are O(dϵ 2) and O(d1/2ϵ 2) respectively. Compared with the state-of-the-art O(d4/3ϵ 2) results achieved by CC-SGLD (Das et al., 2023), the developed stochastic proximal samplers are faster by at least an O(d1/3) factor.

We conduct experiments to compare SGLD with SPSSGLD, where the latter one is implemented by using SGLD to sample p (x|y, b) in the stochastic proximal sampler framework. Empirical results show that SPSSGLD consistently achieves better sampling performance than vanilla SGLD for various problem dimensions.

2. Related Work

This section primarily introduces related work by dividing current gradient-based MCMCs into two categories. The

Faster Sampling via Stochastic Gradient Proximal Sampler

first one is based on discretizing the continuous Langevin dynamics. For the second type, including proximal samplers, the SDE of particles varies a lot. Beyond the sampling algorithms, we will also introduce the usage of the proximal operator in optimization and how it relates to the sampling.

Stochastic Gradient Langevin-based Algorithms. To implement Langevin-based MCMCs with stochastic gradient oracles, the first work is stochastic gradient Langevin dynamic (SGLD) Welling & Teh (2011). Dalalyan & Karagulyan (2019) further establishes the convergence guarantee of SGLD in Wasserstein-2 distance for strongly log-concave targets. Besides, Durmus et al. (2019) analyzes SGLD from a composite optimization perspective and obtains the convergence of the KL divergence. To adapt SGLD to a broader class of target distributions beyond log-concavity, Raginsky et al. (2017); Xu et al. (2018) extend the theoretical analysis of SGLD to distributions satisfying dissipative conditions and proves the convergence when using large minibatch size. This result has been further improved by Zou et al. (2021), which establishes the convergence guarantee of SGLD for sampling non-log-concave distributions for arbitrary mini-batch size. More recently, Das et al. (2023) develops non-asymptotic Center Limit Theorems to quantify the approximate Gaussianity of the noise introduced by the random batch-based stochastic approximations used in SGLD and its variants, which leads to the best known convergence rate, i.e., O(d1.5ϵ 2) and O(d4/3ϵ 2), for distributions satisfying isoperimetric conditions.

Non-Langevin-based Algorithms. There are a number of sampling algorithms are designed based on other Markov processes beyond Langevin. To name a few, Hamiltonian Markov Carlo (HMC) (Neal, 2010) is designed by simulating the particles trajectory in the Hamiltonian s system; diffusion-based MCMCs (Huang et al., 2023; 2024) discretize the reverse process of an Ornstein Uhlenbeck process that initializes at p ; proximal samplers alternatively sample the marginal distributions of a joint distribution. Dong et al. (2022) focus on ODE-based sampling.

In theory, the convergence rate of HMC has been established in Bou-Rabee et al. (2020); Mangoubi & Smith (2017); Mangoubi & Vishnoi (2018); Lee et al. (2018); Chen & Vempala (2022); Durmus et al. (2017); Chen et al. (2020); which achieves smaller sampling error than ULA for sampling both strongly log-concave and non-log-concave targets. Chen et al. (2014); Zou & Gu (2021) further develops a class of stochastic gradient HMC methods and proves the convergence rates in the strongly log-concave setting. The convergence rates of diffusion-based MCMCs are studied in (Huang et al., 2023; 2024), which are demonstrated to be faster than ULA and can be applied to more general settings (e.g., beyond isoperimetric). For the proximal sampler, Lee et al. (2021); Chen et al. (2022) provide its linear conver-

gence rate for different criteria under strongly log-concave or isoperimetric conditions when the exact proximal oracle exists. Liang & Chen (2022); Altschuler & Chewi (2023); Fan et al. (2023) further extend the convergence results to some inexact proximal oracles.

Notably, existing theory for non Langevin-based algorithms are mostly developed in the deterministic setting, while the algorithmic implementation and theoretical analysis in the stochastic setting remain largely understudied, especially when the target distribution is non-log-concave. Our paper provides the first attempts to study the proximal sampler s theoretical and empirical behaviors with only stochastic gradient oracles, which paves the way for exploring other non-Langevin-based algorithms in the stochastic setting.

Applications of the Proximal Operator. Before applying the proximal operator to the sampling algorithms, it is introduced in optimization by the proximal point method (Lemarechal, 2009; 1978; Liang & Monteiro, 2021; 2023; Mifflin, 1982; Rockafellar, 1976; Wolfe, 2009). The proximal point method for minimizing the objective function f is the iteration of the proximal mapping

proxηf(y) := arg min x Rd

f(x) + x y 2/(2η)

with proper choice of η. Using the correspondence f and exp( f) between optimization and sampling, the proximal sampler can be viewed as a sampling counterpart of the proximal point method in optimization (Rockafellar, 1976).

3. Proposed Framework

This section will first introduce the notations commonly used in the following sections. Then, we will specify the assumptions that the target distribution p is required in our algorithms and analysis. After that, the proposed framework and some fundamental properties, such as the error propagation control when sampling from an inexact conditional density p (x|y), will be shown.

3.1. Notations and Assumptions

We suppose the target distribution, i.e., p exp( f) with a finite sum negative log-density, which means

i=1 fi(x) where i, fi : Rd R. (2)

We use letters, e.g., x and x, to denote vectors and random vectors in Rd except for letters b and b, which denote sets and randomized sets. The function fb denotes the energy function deduced by mini-batch b, i.e.,

i b fi(x) where b {1, 2, . . . n}, (3)

Faster Sampling via Stochastic Gradient Proximal Sampler

Results Algorithm Assumptions Metric Complexity

Raginsky et al. (2017) SGLD Dissipative, Component Smooth W2 O(poly(d)ϵ 4)

Zou et al. (2021) SGLD Dissipative, Warm Start, Component Smooth TV O(d4ϵ 2)

Das et al. (2023) AB-SGLD LSI, Finite-Sum, Smooth TV O(d3/2ϵ 2)

Das et al. (2023) CC-SGLD LSI, 6th moment, Smooth TV O(d4/3ϵ 2)

Theorem 4.1 SPS-SGLD LSI, Finite-Sum, Component Smooth TV O(dϵ 2)

Theorem 4.2 SPS-MALA LSI, Finite-Sum, Component Smooth TV O(d1/2ϵ 2)

Table 1. Comparison with prior works for SGLD. d and ϵ mean the dimension and error tolerance. Note that we do not list the assumptions about the stochastic gradient since they vary greatly in different references, which will be discussed in our detailed theorems. The results of our theorem based on [A3] and σ2 = Θ(1). Compared with the state-of-the-art result, the sampling methods with the stochastic proximal sampler have a better convergence rate with an O(d1/3) factor at least.

and fb is the corresponding mini-batch gradient. The notation | | denotes the L1 norm or the number of elements when the inner notation is a vector or a set, respectively. The Euclidean norm (vector) and its induced norm (matrix) are denoted by . For distributions p and q, we use TV (p, q) and KL p q to denote their TV distance and KL divergence, respectively.

Then, we show the assumptions required for p :

[A1] (Component Smooth) For any i {1, 2, . . . , n}, the gradient of fi is L-smooth, which means

fi(x) fi(y) L x y .

[A2] (Log-Sobolev Inequality) The target distribution p satisfies the following inequality

Ep g2 log g2 Ep [g2] log Ep [g2] 2

with a constant α for all smooth function g: Rd R satisfying Ep [g2] < . [A3] (Bounded Variance) For any x Rd, the variance of stochastic gradients is bounded, i.e.,

i=1 fi(x) f(x) 2 σ2.

The component smoothness of the finite sum loss, i.e., [A1], is also required in Raginsky et al. (2017); Zou et al. (2021). [A2] is a kind of isoperimetric condition (Vempala & Wibisono, 2019) which is strictly weaker than the strongly log-concave assumption and even the dissipative assumption (Raginsky et al., 2017). Besides, it implies the target distribution p to have a finite second moment M satisfying M = O(d), which is demonstrated in Appendix A. [A3] recovers the standard uniformly bounded variance assumption, i.e., σ = Θ(1), following from Nemirovski et al. (2009); Ghadimi & Lan (2012; 2013), and sampling references sometimes allow σ2 = Θ(d), e.g., Raginsky et al. (2017); Dalalyan & Karagulyan (2019); Das et al. (2023). Both of these cases will be considered in our analysis.

Algorithm 1 Stochastic Proximal Sampler

1: Input: The negative log density f of the target distribution, the initial particle x0 drawn from p0; 2: for k = 0 to K 1 do 3: Sample ˆxk+1/2 from ˆpk+1/2|k( |xk); 4: Draw the mini-batch bk from {1, 2, . . . , n}; 5: Sample ˆxk+1 from ˆpk+1|k+1/2,b( |ˆxk+1/2, bk); 6: end for 7: Return: ˆx K.

3.2. Stochastic Proximal Sampler

The stochastic proximal sampler (SPS) framework is shown in Alg. 1. With the common notations introduced in Section 3.1, we will explain ˆpk+1/2|k( |xk) and ˆpk+1|k+1/2,b( |xk+1/2, bk), that are similar to standard proximal samplers. Considering a joint target distribution

p (x, y) exp fb(x) x y 2

that is defined by the randomized mini-batch b and the outer loop step size η, then Alg. 1 samples from p (y|x) and p (x|y) alternatively. Specifically, at iteration k, suppose x = xk, y = xk+1/2 and η = ηk, the conditional probability density p (xk+1/2|xk) is equivalent to

2 |k(x |x) exp

which can be exactly implemented by Line 3 of Alg. 1 due to its Gaussianity. Besides, suppose x = xk+1 and y = xk+1/2, the transition kernel p (xk+1|xk+1/2) can be reformulated as

2 ,b(x |x, b) exp

fb(x ) x x 2

which is desired to be implemented with Line 5 of Alg. 1. Rather than exactly sampling from a target distribution, e.g., pk+1|k+ 1

2 ,b(x |x, b), most samplers can only generate approximate samples that are close to the target ones

Faster Sampling via Stochastic Gradient Proximal Sampler

in real practice. Therefore, we consider a Markov process {ˆxk} whose underlying distribution is defined as ˆpk. Given the same initialization ˆp0 = p0, we denote the two empirical transition kernels as ˆpk+ 1

2 |k := pk+ 1

2 |k and ˆpk+1|k+ 1

2 ,b( |x, b) that satisfies

KL ˆpk+1|k+ 1

2 ,b( |x, b) pk+1|k+ 1

2 ,b( |x, b) δk. (7)

Here we assume that the conditional distribution of ˆxk+1 given ˆxk+1/2 is close to the ideal conditional distribution pk+1|k+1/2,b(x |x, b) with up to δk approximation error in KL divergence. In fact, as the distribution pk+1|k+1/2,b(x |x, b) is strongly log-concave when ηk is properly chosen, the condition Eq. 7 can be achieved by applying standard numerical samplers such as SGLD and MALA with provable guarantees (detailed implementations will be discussed in the next section).

Then, the following theorem characterizes the error propagation across multiple steps and provides general results on the sampling error achieved by Alg. 1.

Theorem 3.1. Suppose Assumption [A1]-[A3] hold, and Alg. 1 satisfies:

We have ηk 1 2L for all k {0, 1, . . . , K 1}.

The initial particle ˆx0 is drawn from the standard Gaussian distribution on Rd.

Line 5 is implemented by some specific inner sampler, achieving

KL ˆpk+1|k+ 1

2 ,b( |x, b) pk+1|k+ 1

2 ,b( |x, b) δk

for all k {0, 1, . . . , K 1}.

Then, we have

TV (ˆp K, p )

i=0 (1 + α ηi) 1 .

Theorem 3.1 provides the general upper bound of the TV distance between the underlying distribution of particles returned by Alg. 1 and the target distribution p . The first term in Eq 8 represents the accumulated error of the inexact sampling from pk+1|k+ 1

2 ,b( |x, b), i.e., Line 5 of Alg 1. The second term represents the approximation error using stochastic gradients, and the last term represents the error from deterministic proximal samplers. To achieve an ϵ-TV distance to the target distribution p , one may have to choose a small error tolerance of inexact sampling, i.e., δk = ϵ2,

to control the first term of Eq 8. Besides, it still requires a large enough mini-batch size, i.e., |bi| = Θ(1/(σϵ)2) and the mixing time, i.e., PK 1 i=0 ηi = Θ(log(1/ϵ)), to make the last two terms of Eq 8 small, respectively.

Notably, the implementation of the proximal sampler in Altschuler & Chewi (2023) also allows inexact sampling from pk+1|k+ 1

2 ,b( |x, b) in the second stage update, and requires the underlying distribution of returned particles, i.e., ˆpk+1|k+ 1

2 ,b( |x, b) to satisfy Eq. 7 with a small δk. However, they only consider the deterministic setting, i.e., b = {1, 2, . . . , n}, and requires initializing Line 5 of Alg. 1 with certain stationary points x of f. Hence, directly applying their analysis may require finding stationary points in each iteration, as the function fb changes, which may take substantially more time. This is because, when p only satisfies LSI, the function fb may not be convex. Finding an ϵ-approximate stationary point of a general non-convex function requires O(ϵ 4) (Nesterov, 2013) for stochastic gradient descent, which is unacceptable in sampling algorithms. Therefore, the implementation of Altschuler & Chewi (2023) still remains a concern without exact information, or even only with inexact information, about the stationary points of f.

In our analysis, combining proper Langevin-based MCMC with a ˆxk+1/2 mean Gaussian-type initialization, the gradient complexity for achieving Eq. 7 will only depend on log ˆxk+1/2 2 rather than stationary points x , which will be explicitly shown in the next section. Considering the expected gradient complexity, it requires to characterize Eˆpk+1/2[log ˆxk+1/2 2], which can be readily upper bounded by log[Eˆpk+1/2[ ˆxk+1/2 2]]. This implies that we further need to control the second moment of the particles. This is conducted in the following lemma.

Lemma 3.2. Suppose Assumption [A1]-[A3] hold, and the second moment of the underlying distribution of ˆxk is Mk, then we have

Mk+1 24Mk + 4ηkδk + 24η2 kσ2/|b| + 28M + 24ηkd.

This bound may seem to be large as Mk exhibit an exponential increasing rate. However, we remark that only log(Mk) will appear in our calculation of the gradient complexity rather than Mk itself. Then, let K be the number of total steps, which can be chose to be O(L/α ), then MK will be controlled by exp(K) and so that log(Mk) can be controlled by K = O(L/α ), which will not heavily affect the total gradient complexity.

4. Implementations of SPS

This section mainly focuses on the detailed implementation of the SPS. Specifically, since the target ˆpk+1/2|k of Line 3 of Alg. 1 is a Gaussian-type distribution shown as Eq. 5,

Faster Sampling via Stochastic Gradient Proximal Sampler

Algorithm 2 Inner Stochastic Gradient Langevin Dynamics: Inner SGLD(x0, b, η, δ)

1: Input: The output particle x0 of Alg. 1 Line 3, the selected mini-batch b, the step size of outer loop η, the required accuracy of the inner loop δ; 2: Initialized the returned particle z = 0; 3: Draw the initial particle z0 from N(x0, η I) 4: for s = 0 to S 1 do 5: Draw the mini-batch bs from b; 6: Update the particle

where ξ N(0, I); 7: Update the particle

zs+1 z s τs fbs (z s) + η 1 (z s x0) ;

8: if s > S then 9: Update the returned particle:

z z + z s/(S S + 1);

10: end if 11: end for 12: Return: z.

we can obtain the sample exactly. Then, the key step is to numerically sample from the distribution pk+1|k+1/2,b to ensure that the distribution of the approximate samples, i.e., ˆpk+1|k+1/2,b satisfies Eq. 7. In particular, we will implement this step, i.e., Line 5 of Alg. 1 using two inner samplers: stochastic gradient Langevin dynamics (SGLD) and warm-started Metropolis-adjusted Langevin Algorithm (MALA), which give rise to two stochastic proximal sampling algorithms. In what follows, we will introduce the implementation details of these two algorithms and prove their gradient complexities, i.e., the desired number of stochastic gradient calculations to guarantee ϵ sampling error.

4.1. SGLD Inner Sampler

We consider implementing Line 5 of Alg. 1 with SGLD inner sampler shown in Alg. 2, and name it SPS-SGLD. We point out that the particle update of Alg. 2 is slightly different from the standard SGLD update. In particular, our update is performed with two steps and returns a trajectory average, computed using the last S S iterations, rather than a single particle. The first step of the update, i.e., Line 6 of Alg. 2 performs the diffusion via the Gaussian process, and the second step, i.e., Line 7 of Alg. 2 updates the particle via drift term log ˆpk+1|k+1/2,b. With this implementation, we show the gradient complexity for approaching the target

p in the following theorem. Theorem 4.1. Suppose [A1]-[A3] hold. With proper parameter settings at the following levels

ηk = Θ(L 1), K = Θ(κ), δk = Θ(κ 1ϵ2),

and bo = min Θ(α 1 σ2ϵ 2), n ,

where κ = L/α for Alg. 1, if we choose Alg. 2 as the inner sampler shown in Line 5 Alg. 1, set

τ = min n Θ(κ 1ϵ2(d + σ2) 1), 1

τ = min n Θ(L 1τ), 1

S = Θ(L 1τ 1), τs = τ when s [0, S ],

S = Θ(S + (τ ) 1), τs = τ when s [S + 1, S 1],

and inner minibatch sizes satisfy |bs| = 1, for all s {0, 1, . . . S 1}, the distribution of returned particles ˆp K in Alg. 1 satisfies TV (ˆp K, p ) < 3ϵ. In this condition, the expected gradient complexity will be Θ(κ3(d + σ2)ϵ 2).

Due to the space limitation, we only show an informal result in this section, and the formal version will be deferred to Theorem C.4 in Appendix C.1. Theorem 4.1 provides an O(dϵ2) gradient complexity regardless of σ2 = Θ(d) or σ2 = Θ(1). When σ2 = Θ(d), the state-of-the-art results are O(d3/2ϵ 2) and O(d4/3ϵ 2) under stronger variance assumptions (Das et al., 2023). Compared with those results provided in Das et al. (2023), our SPS-SGLD is faster by at least an O(d1/3) factor with strictly weaker assumptions. When σ2 = Θ(1), the gradient complexity provided in Das et al. (2023) will become O(dϵ2) which is the same as our results.

Notably, in the proof of Theorem C.4, we demonstrate that, with the Gaussian type initialization shown in Line 3 of Alg. 2, the relative Fisher information gap between the underlying distribution of z0 and the target distribution ˆpk+1|k+1/2,b can be upper bounded with a factor log( x0 2+ fb(0) 2) which is independent of stationary points of f and can be controlled by second moment with Lemma 3.2 and variance of stochastic gradients from an expectation perspective. This means the SPS-SGLD can be easily implemented without initialization issues in previous work, e.g., Altschuler & Chewi (2023).

4.2. Warm-started MALA Inner Sampler

We consider implementing Line 5 of Alg. 1 with warmstarted MALA inner sampler shown in Alg. 3, and name it SPS-MALA where the functions g(z) and ψ(z ; z, τ) are defined as follows:

g(z) := log pk+1|k+ 1

2 ,b(z|x0, b) = fb(z) + z x0 2

φ(z ; z, τ) := z (z τ g(z)) 2

Faster Sampling via Stochastic Gradient Proximal Sampler

Algorithm 3 Inner Metropolis-adjusted Langevin algorithm: Inner MALA(x0, b, η, δ)

1: Input: The output particle x0 of Alg. 1 Line 3, the selected mini-batch b, the step size of outer loop η, the required accuracy of the inner loop δ; 2: Draw the initial sampler z0 from Inner ULD(x0, b, η) by Alg. 4 3: for s = 0 to S 1 do 4: Draw z s from N(zs τs g(zs), 2τs I); 5: Define the threshold p to be

p := min 1, exp (g(zs) + φ(z s; zs, τs)) exp (g(z s) + φ(zs; z s, τs))

6: Draw the sample p uniformly from [0, 1]; 7: if p p then 8: Update the particle zs+1 z s 9: else 10: Update the particle zs+1 zs 11: end if 12: end for 13: Return: z S.

Inspired by Altschuler & Chewi (2023), SPS-MALA requires Inner ULD to provide warm starts, i.e., Line 2 of Alg 3, where we defer the implementation of Inner ULD to Appendix A. Compared with general initialization, the gradient complexity MALA can be improved from O(d) to O(d1/2) with warm starts, and ULD can provide warm starts within O(d1/2) gradient complexity. It means Inner MALA will be faster than Inner SGLD by an O(d1/2) factor to achieve the KL convergence, i.e., Eq. 7. Hence, SPS-MALA can be expected to improve the dimensional dependence of SPS-SGLD. With this implementation, i.e., Alg. 3, the TV distance convergence of Alg. 1 can be presented in the following:

Theorem 4.2. Suppose [A1]-[A3] hold. With proper parameter settings at the following levels

ηk = Θ(L 1), K = Θ(κ), δk = Θ(κ 1ϵ2),

and bo = min Θ(α 1 σ2ϵ 2), n ,

where κ = L/α for Alg. 1, if we choose Alg. 3 as the inner sampler shown in Line 5 of Alg. 1, set

γ = Θ(L1/2), τ = Θ(L 1/2d 1/2), and S = Θ(d1/2).

for Alg. 4, and

τ = Θ(L 1d 1/2), and S = Θ(d1/2)

for Alg. 3, then the underlying distribution of returned particles ˆp K in Alg. 1 satisfies TV (ˆp K, p ) < 3ϵ. In this condition, the expected gradient complexity will be Θ κ3d1/2σ2ϵ 2 .

Due to the space limitation, we only show an informal result in this section, and the formal version will be deferred to Theorem C.8 in Appendix C.2. Theorem 4.2 provides gradient complexities of O(d1/2ϵ2) and O(d3/2ϵ2) for cases when σ2 = Θ(1) and σ2 = Θ(d), respectively. When σ2 = Θ(1), the state-of-the-art result is O(dϵ 2) under the lin-growth assumption (Das et al., 2023). Compared with the result provided in Das et al. (2023), our SPS-MALA is faster by an O(d1/2) factor with strictly weaker assumptions. However, the efficiency of SPS-MALA will be greatly affected by the variance, i.e., σ2 in [A3], through the minibatch size of Alg 1. Even when σ2 = Θ(d), the complexity of SPS-MALA will become O(d3/2ϵ2), which is the same as AB-SGLD shown in Table. 1 with weaker assumptions.

Besides, it should be noted that Altschuler & Chewi (2023) and Fan et al. (2023) provide high probability convergence of the TV distance with an O(nκd1/2) gradient complexity, while requiring the stationary points of f. Compared with this result, we have an additional O(κ) factor besides replacing the number of training data n to the Θ(α 1 σ2ϵ 2) batch size. This factor comes from our proof techniques of removing the dependency of stationary points for SPS framework by upper bounding second moments during the entire Alg. 1, which is demonstrated in Section 4.1.

5. Experiments

In this section, we will first provide our experimental settings. Then, for a fair comparison with SGLD, we implement the proximal sampler with SPS-SGLD and show their sampling performance with different dimensions. More experimental results are deferred to Appendix F.

Experimental Settings. Here, we consider the component e fi shares a similar definition in Zou et al. (2019), i.e., e fi(x) := e x b µi 2/2 + e x b+µi 2/2, where the number of input data n = 100, the dimension d {10, 20, 30, 40, 50}, the bias vector b = (3, 3, . . . , 3) , and the data input p

d/10 µi N(µ, Id d) with µ = (2, 2, . . . , 2). Here, we require the input data to shrink with the growth of d, which keeps the distances between different modes for each e fi. Since Zou et al. (2019) had proven the function fi is dissipative, which implies the LSI property of e fi and e f, we omit the discussion about the property of fi in this section.

For the common hyper-parameter settings of SGLD and SPS-SGLD, we fix the number of stochastic gradient oracles as 12000 and the mini-batch size for each iteration as 1. We enumerate the step size of SGLD and the inner step size of SPS-SGLD from 0.2 to 1.4. Besides, the inner loops iterations and the outer loops step sizes are grid-searched with [20, 40, 80] and [1.0, 4.0, 10.0]. Besides, we use the formulation TV(ˆp K, p ) := 1 2d Pd i=1 TV(ˆp K,i, p ,i) to estimate

Faster Sampling via Stochastic Gradient Proximal Sampler

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

Negative Log Likelihood

Grad Num = 250 Grad Num = 500 Grad Num = 1000 Grad Num = 2000

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

2.5 0.0 2.5 5.0 7.5

Negative Log Likelihood

Grad Num = 250 Grad Num = 500 Grad Num = 1000 Grad Num = 2000

Figure 1. The background of all graphs is the projection of the negative log density on a 2d plane, and nodes are the projection of particles returned by different algorithms on the same plane. The first two rows show the distribution of particles projection after different iterations of SGLD and SPS-SGLD with their optimal step sizes when d = 10.

50 40 30 20 10

Dimension Numbers

TV Distance Estimation

SGLD SPS-SGLD

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Step Sizes

TV Distance Estimation

SGLD SPS-SGLD

Figure 2. The graph in the left column shows the TV distance estimation, i.e., TV(ˆp K, p ) when SGLD and SPS-SGLD chose their optimal hyper-parameters under different dimensions. The graph in the right column denotes the TV distance estimation when SGLD and SPS-SGLD chose different step sizes and d = 10.

total variation distances between the target distribution and the underlying distribution of returned particles, where ˆp K,i and p ,i are the marginal distributions of the i-th coordinate. For 1d distributions, their densities can be approximated by the histogram of particles.

Experimental Results. We first show the optimal TV distance to the target distribution p obtained by SGLD and SPS-SGLD under different dimensions in the left column of Fig. 2. Since we consider different problems when using different dimensions, the sampling error does not necessarily increase when d increases. It can be clearly observed that the optimal TV distance of SPS-SGLD is at least 0.5 smaller than that of SGLD in all our dimension settings, which means SPS-SGLD presents a significantly better performance in this synthetic task. Specifically, we investigate the changes in the TV distance with the growth of step sizes for both SPS-SGLD and SGLD, and show the results in the

right column Fig. 2. Although the absolute values of these two algorithms vary a lot, their changing trends are very similar. When the step size is small, both SPS-SGLD and SGLD describe the local landscape of a single mode well. With the growth of step sizes, they can gradually cover all modes, whereas SPS-SGLD achieves a lower TV distance since it can cover modes and keep the local landscape well with a smaller step size. Besides, we provide show distributions of particles projections under different stochastic gradient oracles when d = 10 and the optimal step sizes are chosen in Fig. 1. According to the contour of the projected negative log density of p , we note that SPS-SGLD can cover all modes with a more accurate variance estimation compared with SGLD. It demonstrates that SPS-SGLD generates more reasonable samples with different stochastic gradient oracles from another perspective.

6. Conclusion

This paper is the first study about adapting stochastic gradient oracles to unbiased samplers to draw samples from unnormalized non-log-concave target distributions, i.e., p e f. Specifically, we provide a framework named stochastic proximal samplers (SPS) to remove the unrealistic requirement about stationary points of f in previous implementations (Altschuler & Chewi, 2023). Furthermore, compared with biased samplers SGLD and its variants, two implementations of the SPS framework can converge to the target distribution p with a lower gradient complexity with an O(d1/3) factor at least, and this improvement is validated by our experiments conducted on synthetic data.

Faster Sampling via Stochastic Gradient Proximal Sampler

Acknowledgements

We would like to thank the anonymous reviewers and area chairs for their helpful comments. DZ is supported by NSFC 62306252 and the central fund from HKU IDS. YM is supported by the NSF awards: SCALE Mo DL-2134209, CCF-2112665 (TILOS), the DARPA AIE program, the U.S. Department of Energy, Office of Science, and CDC-RFAFT-23-0069 from the CDC s Center for Forecasting and Outbreak Analytics.

Impact Statement

This paper presents work whose goal is to advance the sampling field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Altschuler, J. M. and Chewi, S. Faster high-accuracy logconcave sampling via algorithmic warm starts. ar Xiv preprint ar Xiv:2302.10249, 2023.

Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006.

Bou-Rabee, N., Eberle, A., and Zimmer, R. Coupling and convergence for hamiltonian monte carlo. 2020.

Chen, T., Fox, E., and Guestrin, C. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683 1691. PMLR, 2014.

Chen, Y., Dwivedi, R., Wainwright, M. J., and Yu, B. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. The Journal of Machine Learning Research, 21(1):3647 3717, 2020.

Chen, Y., Chewi, S., Salim, A., and Wibisono, A. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp. 2984 3014. PMLR, 2022.

Chen, Z. and Vempala, S. S. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. Theory of Computing, 18(1):1 18, 2022.

Cheng, X. and Bartlett, P. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pp. 186 211. PMLR, 2018.

Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. Underdamped langevin mcmc: A non-asymptotic analysis. In Conference on learning theory, pp. 300 323. PMLR, 2018.

Dalalyan, A. S. and Karagulyan, A. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12): 5278 5311, 2019.

Das, A., Nagaraj, D. M., and Raj, A. Utilising the clt structure in stochastic gradient based sampling: Improved analysis and faster algorithms. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4072 4129. PMLR, 2023.

Dong, H., Wang, X., Lin, Y., and Zhang, T. Particle-based variational inference with preconditioned functional gradient flow. ar Xiv preprint ar Xiv:2211.13954, 2022.

Durmus, A., Moulines, E., and Saksman, E. On the convergence of hamiltonian monte carlo. ar Xiv preprint ar Xiv:1705.00166, 2017.

Durmus, A., Majewski, S., and Miasojedow, B. Analysis of langevin monte carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666 2711, 2019.

Fan, J., Yuan, B., and Chen, Y. Improved dimension dependence of a proximal algorithm for sampling. In The Thirty Sixth Annual Conference on Learning Theory, pp. 1473 1521. PMLR, 2023.

Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Huang, X., Dong, H., Hao, Y., Ma, Y., and Zhang, T. Monte carlo sampling without isoperimetry: A reverse diffusion approach, 2023.

Huang, X., Zou, D., Dong, H., Ma, Y., and Zhang, T. Faster sampling without isoperimetry via diffusion-based monte carlo, 2024.

Lee, Y. T., Song, Z., and Vempala, S. S. Algorithmic theory of odes and sampling from well-conditioned logconcave densities. ar Xiv preprint ar Xiv:1812.06243, 2018.

Lee, Y. T., Shen, R., and Tian, K. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pp. 2993 3050. PMLR, 2021.

Lemarechal, C. Nonsmooth optimization and descent methods. 1978.

Faster Sampling via Stochastic Gradient Proximal Sampler

Lemarechal, C. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pp. 95 109. Springer, 2009.

Liang, J. and Chen, Y. A proximal algorithm for sampling from non-smooth potentials. In 2022 Winter Simulation Conference (WSC), pp. 3229 3240. IEEE, 2022.

Liang, J. and Monteiro, R. D. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4): 2955 2986, 2021.

Liang, J. and Monteiro, R. D. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 2023.

Ma, Y.-A., Chatterji, N. S., Cheng, X., Flammarion, N., Bartlett, P. L., and Jordan, M. I. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3), 2021.

Mangoubi, O. and Smith, A. Rapid mixing of hamiltonian monte carlo on strongly log-concave distributions. ar Xiv preprint ar Xiv:1708.07114, 2017.

Mangoubi, O. and Vishnoi, N. Dimensionally tight bounds for second-order hamiltonian monte carlo. Advances in neural information processing systems, 31, 2018.

Mifflin, R. A modification and an extension of Lemar echal s algorithm for nonsmooth minimization. Springer, 1982.

Mou, W., Ma, Y.-A., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. High-order langevin diffusion yields an accelerated mcmc algorithm. The Journal of Machine Learning Research, 22(1):1919 1959, 2021.

Neal, R. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, 5, 1992.

Neal, R. M. Probabilistic inference using markov chain monte carlo methods. 1993.

Neal, R. M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113 162, 2010.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674 1703. PMLR, 2017.

Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999.

Roberts, G. O. and Stramer, O. Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability, 4:337 357, 2002.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976.

Vempala, S. and Wibisono, A. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688, 2011.

Wolfe, P. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pp. 145 173. Springer, 2009.

Wu, K., Schmidler, S., and Chen, Y. Minimax mixing time of the metropolis-adjusted langevin algorithm for logconcave sampling. The Journal of Machine Learning Research, 23(1):12348 12410, 2022.

Xifara, T., Sherlock, C., Livingstone, S., Byrne, S., and Girolami, M. Langevin diffusions and the Metropolis adjusted Langevin algorithm. Stat. Probabil. Lett., 91: 14 19, 2014.

Xu, P., Chen, J., Zou, D., and Gu, Q. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3126 3137, 2018.

Zou, D. and Gu, Q. On the convergence of hamiltonian monte carlo with stochastic gradients. In International Conference on Machine Learning, pp. 13012 13022. PMLR, 2021.

Zou, D., Xu, P., and Gu, Q. Sampling from non-log-concave distributions via variance-reduced gradient langevin dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2936 2945. PMLR, 2019.

Faster Sampling via Stochastic Gradient Proximal Sampler

Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. In Uncertainty in Artificial Intelligence, pp. 1152 1162. PMLR, 2021.

Faster Sampling via Stochastic Gradient Proximal Sampler

A. Additional Notations and Assumptions in Appendix

For the convenience of analysis, we define three Markov processes, i.e., {xk}, { xk} and {ˆxk}, as follows. For the process {xk}, we suppose its initialization x0 is drawn from the standard Gaussian of Rd. There are two transition kernels in this process. The first provides the conditional probability of xk+1/2 when xk is given and can be presented as the same as Eq 5, i.e.,

2 |k(x |x) exp

The second transition kernel denotes the conditional probability of xk+1 when xk+1/2 and a stochastic mini-batch b is given and can be presented as the same as Eq 6, i.e.,

2 ,b(x |x, b) exp

fb(x ) x x 2

For the process { xk}, we suppose the initialization x0 shares the same distribution as x0, and the transition kernel is defined as pk+ 1

2 |k := pk+ 1

2 |k and pk+1|k+ 1

2 ,b(x |x, b) = pk+1|k+ 1

2 ,b( |x, {1, 2, . . . , N}). (9)

For the third process {ˆxk}, it presents the actual Markov process obtained by implementing Alg 1. That is to say, the initialization ˆx0 shares the same distribution as x0. The transition kernel satisfies ˆpk+ 1

2 |k := pk+ 1

KL ˆpk+1|k+ 1

2 ,b( |x, b) pk+1|k+ 1

2 ,b( |x, b) δk.

It should be noted that the transition kernel ˆpk+1|k+ 1

2 ,b( |x, b) does not have a explicit form. Instead, it depends on the sampling process at Line 5 of Alg 1. Although no explicit form is required, it still should be a good approximation of pk+1|k+ 1

2 ,b(x |x, b). At last, to simplify the notation, we denote φσ2 as the density function of the Gaussian distribution N(0, σ2I).

Assumption [A2] implies a bounded second moment:

Lemma A.1. Assume that density p satisfies assumption [A2] that for any smooth function g(x) satisfying Ep [g2] < :

Ep g2 log g2 Ep [g2] log Ep [g2] 2

Then density p has the following variance bound:

Ex p [ x E[x] 2] 2d/α .

Proof. Consider a target distribution p that follows [A2] and for the simplicity of notation denote a constant C = 1/(2α ). We then follow the Herbst argument and take the test function in [A2] to be g(x) = etf(x)/2, for an arbitrary t > 0 and a function f so that f(x) 1. We obtain from the substitution that

Ep h tf(x)etf(x)i Ep [etf(x)] log Ep [etf(x)] CEp h t2etf(x) f(x) 2i CEp h t2etf(x)i .

Denote F(t) = Ep etf(x) . We rewrite the above inequality as a differential inequality:

t F (t) F(t) log F(t) + Ct2F(t),

or equivalently: d dt

t log F(t) C.

Taking t 0, we know that the initial condition is 1

t log F(t) Ep [f(x)]. Therefore, along the entire trajectory

t log F(t) Ep [f(x)] + C t.

Faster Sampling via Stochastic Gradient Proximal Sampler

Algorithm 4 Inner underdamped Langevin algorithm: Inner ULD(x0, b, η, δ)

1: Input: The output particle x0 of Alg. 1 Line 3, the selected mini-batch b, the step size of outer loop η, the required accuracy of the inner loop δ; 2: Initialize the particle with z0 x0 and the velocity v0 is sampled from N(0, I); 3: for s = 0 to S 1 do 4: Draw sample (zs+1, vs+1) from the following Gaussian distribution N (g (zs, vs), Σ) . 5: end for 6: Return: z S.

Plugging in the definition of F(t), that is

Ep h etf(x)i exp t Ep [f(x)] + C t2 .

By Markov s inequality, we obtain that for x p and for any t > 0:

P (f(x) E[f(x)] > λ) exp(Ct2 λt).

Optimizing over t gives

P (f(x) E[f(x)] > λ) exp λ2

Taking f(x) = x, θ , for any θ = 1, gives the standard sub Gaussian tail bound:

P(| x E[x], θ | > λ) 2 exp λ2

which means that random vector x p is

2C-sub Gaussian. This also implies that x p is

2C d-norm-sub Gaussian, leading to the following moment bound:

(E[ x E[x] p])1/p p

We read off the second moment bound from the above inequality: Ex p [ x E[x] 2] 4C d = 2d/α .

Implementation of Inner ULD: Specifically, The closed form of the update of ULD shown in Line 4 of Alg. 4 satisfies g : Rd Rd Rd Rd defined as

g (z, v) := z + γ 1(1 a)v γ 1 τ γ 1(1 a) g(z), av γ 1(1 a) g(z) ,

where a := exp( γτ), and

γ (1 a) + 1 2γ (1 a2) Id 2 γ 1

2 a + a2 Id 2 γ 1

2 a + a2 Id (1 a2) Id

Such an iteration corresponds to the discretization of the following SDE

dvt = g(zs; x0, b, η)dt γvtdt + p

where Bt is a standard d-dimensional Brownian motion. This update is introduced in several references, including Cheng et al. (2018); Altschuler & Chewi (2023).

Faster Sampling via Stochastic Gradient Proximal Sampler

B. Lemmas for SPS Framework

Lemma B.1 (variant of data-processing inequality). Consider four random variables, x, z, x, z, whose underlying distributions are denoted as px, pz, qx, qz. Suppose px,z and qx,z denotes the densities of joint distributions of (x, z) and ( x, z), which we write in terms of the conditionals and marginals as

px,z(x, z) = px|z(x|z) pz(z) = pz|x(z|x) px(x)

qx,z(x, z) = qx|z(x|z) qz(z) = qz|x(z|x) qx(x).

then we have KL px,z qx,z =KL pz qz + Ez pz KL px|z( |z) qx|z( |z)

=KL px qx + Ex px KL pz|x( |x) qz|x( |x)

where the latter equation implies KL px qx KL px,z qx,z .

Proof. According to the formulation of KL divergence, we have

KL px,z qx,z = Z px,z(x, z) log px,z(x, z)

qx,z(x, z)d(x, z)

= Z px,z(x, z) log px(x)

qx(x) + log pz|x(z|x)

= Z px,z(x, z) log px(x)

qx(x)d(x, z) + Z px(x) Z pz|x(z|x) log pz|x(z|x)

qz|x(z|x)dzdx

=KL px qx + Ex px KL pz|x( |x) qz|x( |x) KL px qx ,

where the last inequality follows from the fact

KL pz|x( |x) pz|x( |x) 0 x.

With a similar technique, it can be obtained that

KL px,z qx,z = Z px,z(x, z) log px,z(x, z)

qx,z(x, z)d(x, z)

= Z px,z(x, z) log pz(z)

qz(z) + log px|z(x|z)

= Z px,z(x, z) log pz(z)

qz(z)d(x, z) + Z pz(z) Z px|z(x|z) log px|z(x|z)

qx|z(x|z)dzdx

=KL pz qz + Ez pz KL px|z( |z) px|z( |z) .

Hence, the proof is completed.

Lemma B.2 (strong log-concavity and smoothness of inner target functions). Using the notations presented in Section A, for any k {0, 1, . . . , K 1}, x Rd, and b {1, 2, . . . , n}, suppose ηk < 1/L, then the target distributions of inner loops, i.e., pk+1|k+1/2,b( |x, b), satisfy

( L + η 1 k ) I 2 x log pk+1|k+1/2,b(x |x, b) (L + η 1 k ) I

Proof. For any k {0, 1, . . . , K 1}, x Rd, and b {1, 2, . . . , n}, we have

2 ,b(x |x, b) = C(b, ηk, x) 1 exp

fb(x ) x x 2

which implies 2 x log pk+1|k+1/2,b(x |x, b) = 2fb(x ) + η 1 k I.

Faster Sampling via Stochastic Gradient Proximal Sampler

Since we have [A1], it has ( L + η 1 k ) I 2fb(x ) + η 1 k I (L + η 1 k ) I.

Hence, the proof is completed.

Lemma B.3. Using the notations presented in Section A, for any k {0, 1, . . . , K 1}, x Rd and b {1, 2, . . . , n}, suppose it has η < 1/L, then we have

KL pk+1|k+ 1

2 ,b( |x, b) pk+1|k+ 1

2 ,b( |x, b) 1 2(η 1 L) Ex pk+1|k+ 1

h f(x ) fb(x ) 2i

Proof. We abbreviate pk+1|k+1/2,b( |x, b) and pk+1|k+1/2,b( |x, b) as p and p for convenience. According to the definition of p, i.e., Eq 6, and p, i.e., Eq 9, we have

p(x ) = C(b, η, x) 1 exp

fb(x ) x x 2

p(x ) = C(η, x) 1 exp

f(x ) x x 2

According to Lemma B.2, we have 2 log p(x ) L + η 1 I,

which means the density function p is strongly log-concave when η < 1/L. According to Lemma E.2, the density function p satisfies LSI with a constant (η 1 L). Then, with the definition of LSI, we have

KL p p 1 2(η 1 L) Ex p

" log p(x )

= 1 2(η 1 L) Ex p h f(x ) fb(x ) 2i

Hence, the proof is completed.

Lemma B.4. Using the notations presented in Section A and considering Alg 1, if ηi 1/(2L) for all i {0, 1, . . . , K 1}, then we have

TV ( p K, p K) σ

where | | denotes the sample size in each mini-batch loss.

Proof. According to Pinsker s inequality, we have

TV (p K, p K)

1 2KL p K p K .

Let pk+1,k+1/2,b and pk+1,k+1/2,b denote the density of joint distribution of (xk+1, xk+1/2, bk) and ( xk+1, xk+1/2, bk) respectively, which we write in term of the conditionals and marginals as

2 ,b(x , x, b) =pk+1|k+ 1

2 ,b(x |x, b) pk+ 1

2 ,b(x, b) = pk+ 1

2 ,b|k+1(x, b|x ) pk+1(x )

2 ,b(x , x, b) = pk+1|k+ 1

2 ,b(x |x, b) pk+ 1

2 ,b(x, b) = pk+ 1

2 ,b|k+1(x, b|x ) pk+1(x ).

In this condition, we have

KL pk+1 pk+1 KL pk+1,k+ 1

2 ,b pk+1,k+ 1

2 ,b + E( x, b) pk+ 1

h KL pk+1|k+ 1

2 ,b( | x, b) pk+1|k+ 1

2 ,b( | x, b) i

+ E x pk+ 1

h KL pb|k+ 1

2 ( | x) pb|k+ 1

+ E( x, b) pk+ 1

h KL pk+1|k+ 1

2 ,b( | x, b) pk+1|k+ 1

2 ,b( | x, b) i ,

Faster Sampling via Stochastic Gradient Proximal Sampler

which follows from Lemma B.1. Respectively, for the first and the second equation, we plug

x := xk+1, z := xk+ 1

2 , bk , x := xk+1 and z := xk+ 1

and x = xk+ 1

2 , z := bk, x := xk+ 1

2 and z := bk,

to Lemma B.1. Here, we should note the choice of bk is introduced as an auxiliary random variable, which is independent with the update of xk for all k {0, 1, . . . , K 1}. Then, by requiring

2 ( |x) = pb|k+ 1

2 ( |x) = pb x Rd and ηk 1/(2L). (10)

KL pk+1 pk+1 KL pk+ 1

+ E( x, b) pk+ 1

h KL pk+1|k+ 1

2 ,b( | x, b) pk+1|k+ 1

2 ,b( | x, b) i

+ 1 2 (η 1 k L) E( x, b) pk+ 1

Ex pk+1|k+ 1

2 ,b( | x, b) f(x ) f b(x ) 2

+ ηk E( x, b) pk+ 1

Ex pk+1|k+ 1

2 ,b( | x, b) f(x ) f b(x ) 2 (11)

where the second inequality follows from Lemma B.3 and the last inequality follows from the choice of step size satisfies ηk.

Then, we consider the upper bound for the second term of RHS of Eq 11 and have

E( x, b) pk+ 1

Ex pk+1|k+ 1

2 ,b( | x, b) f(x ) f b(x ) 2

= Z pk+1,k+ 1

2 ,b(x , x, b) f(x ) f b(x ) 2 d(x , x, b). (12)

The density pk+1,k+ 1

2 ,b(x , x, b) of the joint distribution satisfies

2 ,b(x , x, b) = pk+1|k+ 1

2 ,b(x |x, b) pk+ 1

= pk+1|k+ 1

2 (x |x) pk+ 1

2 (x) pb|k+ 1

= pk+1|k+ 1

2 (x |x) pk+ 1

2 (x) pb(b),

where the second equation establishes since the choice of bk will not affect the update of xk shown in Eq 9. Besides, the last inequality follows from Eq 10 and the fact that the choice of bk is independent with the choice of xk shown in Line 4 of Alg 1. Combining Eq 12 and Eq 13, we have

E( x, b) pk+ 1

Ex pk+1|k+ 1

2 ,b( | x, b) f(x ) f b(x ) 2

b 1,2,...,n

Z pk+1|b(x )pb(b) f(x ) f b(x ) 2 dx

= Z pk+1(x )Ebk [ f(x ) fbk(x ) ] dx σ2

where the last inequality follows from [A3] and Lemma E.1. Hence, Eq 11 satisfies

KL pk+1 pk+1 KL pk+ 1

Then, consider the first stage of the update, we have

KL pk pk + E x pk h KL pk+ 1

2 |k( | x) pk+ 1

2 |k( | x) i = KL pk pk , (16)

Faster Sampling via Stochastic Gradient Proximal Sampler

where the first inequality follows from Lemma B.1 by setting

2 , z := xk, x := xk+ 1

2 and z := xk,

and the second equation establishes since {xk} and xk share the same update in the first stage shown in Eq 5 and Eq 9.

Combining Eq 15 and Eq 16, we have

KL pk+1 pk+1 KL pk pk + σ2 ηk

which implies

KL p K p K σ2

with the telescoping sum. Hence, the proof is completed.

Lemma B.5. Using the notations presented in Section A, we have

TV (ˆp K, p K)

where δ denotes the error tolerance of approximate conditional densities shown in Eq 7.

Proof. According to Pinsker s inequality, we have

TV (ˆpk+1, pk+1)

1 2KL ˆpk+1 pk+1 .

Let pk+1,k+1/2,b and ˆpk+1,k+1/2,b denote the density of joint distribution of (xk+1, xk+1/2, bk) and (ˆxk+1, ˆxk+1/2, ˆbk) respectively, which we write in term of the conditionals and marginals as

2 ,b(x , x, b) =pk+1|k+ 1

2 ,b(x |x, b) pk+ 1

2 ,b(x, b) = pk+ 1

2 ,b|k+1(x, b|x ) pk+1(x )

2 ,b(x , x, b) =ˆpk+1|k+ 1

2 ,b(x |x, b) ˆpk+ 1

2 ,b(x, b) = ˆpk+ 1

2 ,b|K+1(x, b|x ) ˆpk+1(x ).

In this condition, we have

KL ˆpk+1 pk+1 KL ˆpk+1,k+ 1

2 ,b pk+1,k+ 1

= KL ˆpk+ 1

2 ,b + E(ˆx,ˆb) ˆpk+ 1

h KL ˆpk+1|k+ 1

2 ,b( |ˆx, ˆb) pk+1|k+ 1

2 ,b( |ˆx, ˆb) i

= KL ˆpk+ 1

+ Eˆx ˆpk+ 1

h KL ˆpb|k+ 1

2 ( |ˆx) pb|k+ 1

+ E(ˆx,ˆb) ˆpk+ 1

h KL ˆpk+1|k+ 1

2 ,b( |ˆx, ˆb) pk+1|k+ 1

2 ,b( |ˆx, ˆb) i

where the first and the second equations are established by plugging

x := ˆxk+1, z := ˆxk+ 1

2 , ˆbk , x := xk+1 and z := xk+ 1

and x = ˆxk+ 1

2 , z := ˆbk, x := xk+ 1

2 and z := bk,

to Lemma B.1, respectively. Then, by requiring

2 ( |x) = pb|k+ 1

2 ( |x) = pb x Rd, (17)

Faster Sampling via Stochastic Gradient Proximal Sampler

KL ˆpk+1 pk+1 KL ˆpk+ 1

+ E(ˆx,ˆb) ˆpk+ 1

h KL ˆpk+1|k+ 1

2 ,b( |ˆx, ˆb) pk+1|k+ 1

2 ,b( |ˆx, ˆb) i

where the last inequality follows from Eq 7. Besides, considering the first stage of the update, we have

KL ˆpk pk + Eˆx ˆpk h KL ˆpk+ 1

2 |k( |ˆx) pk+ 1

2 |k( |ˆx) i = KL ˆpk pk , (19)

where the first inequality follows from Lemma B.1 by setting

2 , z := ˆxk, x := xk+ 1

2 and z := xk,

and the second equation establishes since xk and ˆxk share the same update in the first stage shown in Eq 5 and Eq 7. Combining Eq 18 and Eq 19, we have

KL ˆpk+1 pk+1 KL ˆpk pk + δk,

which implies

with the telescoping sum. Hence, the proof is completed.

Lemma B.6. Suppose Assumption [A1]-[A3] hold, and Alg. 1 satisfy:

The step sizes have ηi 1/(2L) for all i {0, 1, . . . , K 1}.

The initial particle ˆx0 is drawn from the standard Gaussian distribution on Rd.

The transition kernel at Line 5 of Alg. 1, i.e., ˆpk+1|k+ 1

2 ,b( |x, b), satisfies Eq 7 and δk = 0.

Then, we have

TV (ˆp K, p ) σ

i=0 (1 + α ηi) 1 .

Proof. When δk = 0, the Markov process {ˆxk} shares the same underlying distribution as the Markov process {xk} . We consider to upper bound the total variation distance between p K and p which satisfies

TV (p K, p ) TV (p K, p K) + TV ( p K, p ) . (20)

According to Lemma B.4, by requiring ηi 1/(2L) for all i {0, 1, . . . , K 1}, we have

TV (p K, p K) σ

ηi 2|bi|. (21)

Besides, for TV ( p K, p ) in Eq 20, we have

TV ( p K, p )

1 2KL p K p

i=0 (1 + α ηi) 1

i=0 (1 + α ηi) 1 (22)

where the first inequality follows from Pinsker s inequality, the second follows from Lemma E.3, and the last inequality follows from Lemma E.4 when we set p0 as the standard Gaussian in Rd. Finally, plugging Eq 21 and Eq 22 to Eq 20, the proof is completed.

Faster Sampling via Stochastic Gradient Proximal Sampler

Proof of Theorem 3.1. Using the notations presented in Section A, we consider to upper bound the total variation distance between ˆp K+1 and p which satisfies

TV (ˆp K, p ) TV (ˆp K, p K) + TV (p K, p ) .

According to Lemma B.5, we have

TV (ˆp K, p K)

i=0 δi. (23)

Besides, we have

TV (p K, p ) σ

i=0 (1 + α ηi) 1 (24)

with Lemma B.6. Here, we should note the gradient complexity of Alg 1 will be dominated by Line 5, i.e., the inner sampler which requires GC(|bk|, δk) at the k-th iteration. Therefore, the total gradient complexity will be

i=0 GC(|bi|, δi)

and the proof is completed.

C. Theorems for Different Implementations

C.1. Stochastic Gradient Langevin Dynamics Inner Samplers

Lemma C.1. Using the notations presented in Alg 2, asume [A1]-[A3], for any τs (0, 1

36], we have

2τs KL q s pk+1|k+ 1

2 ,b( |x0, b) 1 τs

W 2 2 (qs, pk+1|k+ 1

2 ,b( |x0, b))

W 2 2 (qs+1, pk+1|k+ 1

2 ,b( |x0, b)) + 4τ 2 s σ2

|bs| + 6τ 2 s d η

where qs, q s and q denotes underlying distribution of zs, z s and the ideal output particles.

Proof. This proof only considers the KL divergence behavior for the k-th inner sampling subproblem, i.e., Line 5 of Alg 1. The target distribution of the inner loop, i.e., pk+1|k+1/2,b( |x0, b) will be abbreviated as

q (z) := C 1 q exp( g(z)) = C 1 q exp

fb(z) z x0 2

Since Inner SGLD sample mini-batch bs from b for all s {1, 2, . . . , S}, we define

gbs(z) := 1

i bs fi(z) z x0 2

Combining Lemma B.2 and the choice of the step size, i.e., η 1/2L, we have

(2η) 1 I 2g(z) = 2q (z) (3/2η) I.

Suppose the underlying distribution of zs and z s are qs and q s respectively. Besides, the KL divergence between qs and q is

KL qs q = Z qs(z) log qs(z)

q (z)dz = Z qs(z) log qs(z)dz | {z } H(qs)

+ Z qs(z) (g(z) + log Cq) dz | {z } E(qs)

Then we consider the dynamics of entropy H and energy E functionals with the iteration presented as

1 ξ where ξ N(0, I),

zs+1 = z s τs gbs (z s) .

Faster Sampling via Stochastic Gradient Proximal Sampler

Energy functional dynamics We start with the following inequality

W 2 2 (qs+1, q ) E(z s,z ) γ s h Ezs+1 q s+1|s( |z s) zs+1 z 2i ,

where γ s denotes the optimal coupling between the densities q s and q , and q s+1|s( |z s) denotes the density function for zs+1 when z s = z s. According to the change of variables, the inner expectation on the RHS satisfies

Ezs+1 q s+1|s( |z s) zs+1 z 2 = X

bs b pb(bs) z s τs gbs(z s) z 2

= z s z 2 2τs g(z s), z s z + τ 2 s Ebs gbs(z s) 2

z s z 2 2τs (g(z s) g(z)) + τ 2 s Ebs gbs(z s) 2 ,

where the last inequality follows from the strong convexity of g, i.e.,

g(z) g(z s) g(z s), z z s + 1

4η z z s 2 .

Taking the expectation for both sides of Eq 25, we have

E(z s,z) γ s h Eq s+1|s( |z s) zs+1 z 2i 1 τs

W 2 2 (q s, q ) 2τs (E(q s) E(q ))

+ τ 2 s E(z s,z) γ s h Ebs gbs(z s) 2i . (26)

Then, we start to upper bound the last term of Eq 26, and have

E(z s,z) γ s h Ebs gbs(z s) 2i = E(z s,z) γ s h Ebs gbs(z s) gbs(z) + gbs(z) 2i

2E(z s,z) γ s h Ebs gbs(z s) gbs(z) 2i + 2E(z s,z) γ s h Ebs gbs(z) g(z) + g(z) 2i

2 E(z s,z) γ s z s z 2 + 4E(z s,z) γ s h Ebs gbs(z) g(z) 2i + 4Ez q h g(z) 2i .

For the first term, with the definition of γ s, we have

E(z s,z) γ s z s z 2 = W 2 2 (q s, q ).

For the second one, suppose we sample bs uniformly from b sharing the same sampler number for all s {1, 2, . . . , S}, i.e., bin. Then, for any z Rd, we have

Ebs gbs(z) g(z) 2 = Ebs h fbs(z) f(z) 2i σ2

which follows from Lemma E.1. It then implies

E(z s,z) γ s h Ebs gbs(z) g(z) 2i σ2

For the last term, we have

Ez q h g(z) 2i 3d

which follows from Lemma E.6. In these conditions, Eq 27 can be represented as

E(z s,z) γ s h Ebs gbs(z s) 2i 9

2η W 2 2 (q s, q ) + 4σ2

Faster Sampling via Stochastic Gradient Proximal Sampler

Plugging this inequality into Eq 26, we have

W 2 2 (qs+1, q ) 1 τs

2η + 9τ 2 s η

W 2 2 (q s, q ) 2τs (E(q s) E(q )) + 4τ 2 s σ2

bin + 6τ 2 s d η ,

which is equivalent to

2τs (E(q s) E(q )) 1 τs

2η + 9τ 2 s η

W 2 2 (q s, q ) W 2 2 (qs+1, q ) + 4τ 2 s σ2

bin + 6τ 2 s d η .

By requiring 9τ 2 s η τs

2τs (E(q s) E(q )) 1 τs

W 2 2 (q s, q ) W 2 2 (qs+1, q ) + 4τ 2 s σ2

bin + 6τ 2 s d η . (28)

Entropy functional bound According to Lemma E.7, we have

(H(q s) H(q )) W 2 2 (qs, q ) W 2 2 (q s, q ),

which is equivalent to

2τs (H(q s) H(q )) 1 τs

W 2 2 (qs, q ) 1 τs

W 2 2 (q s, q ). (29)

Therefore, combining Eq 28 and Eq 29, we have

2τs KL q s q 1 τs

W 2 2 (qs, q ) W 2 2 (qs+1, q ) + 4τ 2 s σ2

bin + 6τ 2 s d η . (30)

Hence, the proof is completed.

Corollary C.2. Using the notations presented in Alg 2, asume [A1]-[A3]. Define:

τs := τ min

( δ 16 2σ2η

bin + 3d 1 , 1

, S log 2W 2 2 (q1, pk+1|k+ 1

2 ,b( |x0, b))

where bin denotes the uniformed minibatch size of sampled in Line 5 of Alg 2. Then, the underlying distribution of particles at S-th iteration, i.e., q S, satisfies W 2 2 (q S, pk+1|k+ 1

2 ,b( |x0, b)) δ.

Proof. Similar to Lemma C.1, the target distribution of the inner loop, i.e., pk+1|k+1/2,b( |x0, b) will be abbreviated as

q (z) := C 1 q exp( g(z)) = C 1 q exp

fb(z) z x0 2

and we define the minibatch loss as follows

gbs(z) := 1

i bs fi(z) z x0 2

Then, using Lemma C.1 and since the KL divergence is non-negative, for all s {0, 2, . . . , S 1}, we have

W 2 2 (qs+1, q ) 1 τs

W 2 2 (qs, q ) + 4τ 2 s σ2

|bs| + 6τ 2 s d η .

Faster Sampling via Stochastic Gradient Proximal Sampler

Following from a direct induction, we have

W 2 2 (q S, q )

W 2 2 (q0, q ) +

|bi| + 6τ 2 i d η

In this condition, we choose uniformed step and mini-batch sizes, i.e., τs = τ, |bs| = bin, and have

W 2 2 (q S, q ) 1 τ

S W 2 2 (q0, q ) + 4σ2

i=0 τ 2 1 τ

S W 2 2 (q0, q ) + 2σ2η

bin + 3d 8τ.

Using that for all u R+, 1 u exp( u), then it has

S W 2 2 (q1, q ) exp τS

W 2 2 (q0, q ) δ

Without loss of generality, the iteration number of inner loop will be large, which implies the last inequality of Eq 32 will establish by requiring

τS log 2W 2 2 (q1, pk+1|k+ 1

2 ,b( |x0, b))

In the following, we choose the value of τS to be the lower bound. Besides, we require the last term of Eq 31 to satisfy

bin + 3d 8τ δ

bin + 3d 1 . (33)

Combining Eq 32 and Eq 33, the proof is completed.

Lemma C.3. Using the notations presented in Alg 2, asume [A1]-[A3]. Define

S log 2W 2 2 (q1, pk+1|k+ 1

2 ,b( |x0, b))

δ 4ητ 1 and S N+,

for all s [0, S ], the step sizes and sample sizes satisfy

|bs| = bin and τs := τ min

( δ 16 2σ2η

bin + 3d 1 , 1

in Alg 2. Besides, for s [S + 1, S], the step sizes and sampler sizes are

|bs| = b in and τs := τ min

In this condition, if the total iteration number S satisfies

S S + (τ ) 1 and S N+,

then the underlying distribution q S of output particles satisfies KL

q S pk+1|k+ 1

2 ,b( |x0, b) δ.

Proof. We first introduce 0 < S < S satisfying S N+, and denote the underlying distribution of output particles as

PS i=S +1 q i S S where i N+

Faster Sampling via Stochastic Gradient Proximal Sampler

and q i denotes the underlying distribution of z i in Alg 2. Similar to Lemma C.1, the target distribution of the inner loop, i.e., pk+1|k+1/2,b( |x0, b) will be abbreviated as q ( ). Then, we set all step and sample sizes between S -th to S-th iteration are uniformed τ and b in. In this condition, we have

q S q 1 S S

i=S +1 KL q i q

1 2τ (S S )

W 2 2 (q S +1, q )

4η W 2 2 (qi, q ) W 2 2 (q S+1, q )

+(S S ) 4(τ )2σ2

b in + 6(τ )2d

W 2 2 (q S +1, q ) 2τ (S S ) + 2τ σ2

b in + 3τ d

where the first inequality follows from Lemma E.5 and the second inequality follows from Lemma C.1. According to Corollary C.2, in Alg 2, if we set

τs := τ min

( δ 16 2σ2η

bin + 3d 1 , 1

, S log 2W 2 2 (q1, q )

for all s [0, S ], then we have W 2 2 (q S +1, q ) δ. In this condition, by requiring

τ (S S ) 1, and τ δ

the first and the second term of Eq 34 will satisfies

W 2 2 (q S +1, q ) 2τ (S S ) δ

2, and 2τ σ2

b in + 3τ d

Hence, the proof is completed.

Theorem C.4 (Formal version of Theorem 4.1). Suppose [A1]-[A3] hold. With the following parameter settings

α log (1 + L2)d

4α ϵ2 , δk = 2ϵ2α

L log (1 + L2)d

bo = min σ2

4α ϵ2 log (1 + L2)d

4α ϵ2 , n ,

for Alg 1, if we choose Alg 2 as the inner sampler shown in Line 5 Alg 1, set

16 σ2 + 3Ld log (1 + L2)d

4L σ2 + 3Ld log (1 + L2)d

S (x0, b) =

fb(0) 2 + L + L x0 2

+ log log (1 + L2)d

fb(0) 2 + L + L x0 2

+ log log (1 + L2)d

Lτ + (τ ) 1,

τs = τ when s [0, S (x0, b)]

τs = τ when s [S (x0, b) + 1, S(x0, b) 1]

Faster Sampling via Stochastic Gradient Proximal Sampler

and 1 inner minibatch size, i.e., bin = 1, then the underlying distribution of returned particles ˆp K in Alg 1 satisfies TV (ˆp K+1, p ) < 3ϵ. In this condition, the expected gradient complexity will be

34L3(σ2 + 3d)

α3 ϵ2 log(24L2) log2 (1 + L2)d

4α ϵ2 log 30L2 M + σ2 + d + 1 + f(0) 2

which can be abbreviated as Θ(κ3ϵ 2 (d + σ2)).

Proof. For the detailed implementation of Alg 1 with Alg 2, we consider the following settings.

For all k {0, 1, . . . , K 1}, the mini-batch bk in Alg 1 Line 2 has a uniformed norm which is denoted as |bk| = bo.

For all k {0, 1, . . . , K 1}, the conditional probability densities pk+1|k+1/2,b( |xk+1/2, bk) in Alg 1 Line 4 formulated as Eq 6 share the same L-2 regularized coefficients, i.e., η 1 k .

For all k {0, 1, . . . , K 1}, the inner sampler shown in Alg 1 Line 5 is chosen as Alg 2.

Errors control of outer loops. With these conditions, we have

TV (ˆp K, p )

4α (1 + α η) K

which follows from Theorem 3.1. For achieving TV (p K+1, p ) O(ϵ), we start with choosing the step size η and the iteration number K in Alg 1. By requiring

2L and K (α η) 1 log (1 + L2)d

α log (1 + L2)d

4α ϵ2 , (35)

(1 + α η)2K exp(α ηK) (1 + L2)d

4α ϵ2 exp( α Kη) ϵ,

where the first inequality follows from 1 + u exp(u/2) when u 1. The last equation of Eq 35 establishes when η is chosen as its upper bound. Besides by requiring

bo min Kησ2

2ϵ2 , n = min σ2

α ϵ2 log (1 + L2)d

4α ϵ2 , n , (36)

we have σ p

Kη/(2bo) ϵ. The last equation of Eq 36 requires the choice of η and K in Eq 35 to be their upper and lower bound respectively. For simplicity, we consider inner samplers for all iterations share the same error tolerance, i.e., δk = δ for all k {1, 2, . . . , K}. By requiring,

L log (1 + L2)d

1 2 PK 1 i=0 δi ϵ. The last inequality of Eq 37 holds when K is chosen as its lower bound in Eq 35.

Errors control of inner loops. Then, we start to consider the hyper-parameter settings of the inner loop and the total gradient complexity. According to Theorem 3.1, we require the underlying distribution of output particles of the inner loop, i.e., ˆpk+1|k+ 1

2 ,b( |x0, b), satisfies

KL ˆpk+1|k+ 1

2 ,b( |x0, b) pk+1|k+ 1

2 ,b( |x0, b) δ ϵ2α

L log (1 + L2)d

Faster Sampling via Stochastic Gradient Proximal Sampler

for all x0 Rd and b {1, 2, . . . , n}. Then, to achieve Eq 38, Lemma C.3 will decompose the total inner iterations of Alg 2, i.e., s [0, S(x0, b)] into two stages.

For the first stage, we consider

τs := τ min

( δ 16 2σ2η

bin + 3d 1 , 1

16 σ2 + 3Ld log (1 + L2)d

for s [0, S (x0, b)] where

log 2L W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b))

α ϵ2 + log log (1 + L2)d

Lτ and S (x0, b) N+. (40)

It should be noted that the last equation of Eq 39 only establishes when δ and η are chosen as their upper bounds, and bin = 1.

For the second stage, we consider

τs := τ min

4L σ2 + 3Ld log (1 + L2)d

for s [S (x0, b) + 1, S(x0, b) 1] where

S(x0, b) S (x0, b) + (τ ) 1

log 2L W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b))

α ϵ2 + log log (1 + L2)d

32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

+ 4Lσ2 + 12L2d

α ϵ2 log (1 + L2)d

It should be noted that the last equation of Eq 42 only establishes when δ and η are chosen as their upper bounds, and b in = 1.

Since the choice of S(x0, b) depend on the upper bound of W 2 2 (q1, pk+1|k+ 1

2 ,b( |x0, b)), we start to bound it. Line 3 of Alg 2 has presented that q0 is a Gaussian-type distribution with η 1-strong convexity, then we have q0 also satisfies η 1-LSI due to Lemma E.2, which implies

W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b)) 2ηKL q0 pk+1|k+ 1

2 ,b( |x0, b) η2FI q0 pk+1|k+ 1

2 ,b( |x0, b) .

Noted that the relative Fisher information satisfies

FI q0 pk+1|k+ 1

2 ,b( |x0, b) = Z q0(z)

log q0(z) pk+1|k+ 1

2 ,b(z|x0, b)

= Z q0(z) fb(z) fb(0) + fb(0) f(0) + f(0) 2 dz

3L2Ez q0[ z 2] + 3 fb(0) f(0) 2 + 3 f(0) 2

= 3L2(η + x0 2) + 3 fb(0) f(0) 2 + 3 f(0) 2 .

where the first inequality follows from [A1] with respect to fb, and the last equation follows from the explicit form of the mean and variance of Gaussian-type q0. Taking the expectation for both sides, we have

Ex0,b h W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b)) i 3η2 L2η + L2Ex0 h x0 2i + Eb h fb(0) f(0) 2i + f(0) 2

Ex0 h x0 2i + 1

2L + Eb h fb(0) f(0) 2i

L2 + f(0) 2

Ex0 h x0 2i + (2L2) 1 2 f(0) 2 + L + 2σ2/|b| Ex0 h x0 2i + 2 f(0) 2 + L + 2σ2

Faster Sampling via Stochastic Gradient Proximal Sampler

where the second inequality follows from the choice of η, the third inequality follows from Lemma E.1, and the last inequality establishes since |b| 1. To solve this problem, we start with upper bounding the second moment, i.e., Mk of pk for any k [1, K]. For calculation convenience, we suppose L 1, δ < 1 without loss of generality and set

Cm := 4ηδ + 6σ2

η2 + 4 M + 6d

η 2 + 6σ2 + (24L2 + 4)M + 12Ld.

In this condition, following from Lemma 3.2, we have

η2 k Mk + 4ηkδk + 6σ2

η2 k + 4 M + 6d

ηk = 24L2Mk + Cm,

which implies

Mk 24L2 k M + Cm 1 + 24L2 + . . . + 24L2 k 1 24L2 k M + Cm 24L2 1

24L2 K M + 2 + 6σ2 + (24L2 + 4)M + 12Ld .

Additionally, Lemma 3.2 also demonstrates that

2 Mk + ηkd 24L2 K M + 2d + 6σ2 + (24L2 + 4)M + 12Ld

for all k [0, K 1]. Plugging Eq 44 into Eq 43, we have

Ex0,b h W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b)) i

24L2 K 1 M + Cm + 2 f(0) 2 + L + 2σ2

24L2 K 1 30L2 M + σ2 + d + 1 + f(0) 2 ,

which implies

Ex0,b h log(2L W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b))) i log E h 2L W 2 2 (q0, pk+1|k+ 1

2 ,b( |x0, b) i

log h 24L2 K M + σ2 + d + 1 + f(0) 2 i

= K log(24L2) + log 30L2 M + σ2 + d + 1 + f(0) 2

α log (1 + L2)d

4α ϵ2 log(24L2) + log 30L2 M + σ2 + d + 1 + f(0) 2 ,

where the first inequality follows from Jensen s inequality and the last inequality follows from the parameters choice shown in Eq 35. By choosing S(x0, b) to its lower bound and taking the expectation for both sides of Eq 42, we have

Ex0,b [S(x0, b)] 32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

4α ϵ2 log log (1 + L2)d

4α ϵ2 + 4Lσ2 + 12L2d

α ϵ2 log (1 + L2)d

+ 32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

2L W 2 2 (q1, pk+1|k+ 1

2 ,b( |x0, b))

32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

4α ϵ2 log log (1 + L2)d

4α ϵ2 + 4Lσ2 + 12L2d

α ϵ2 log (1 + L2)d

+ 32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

log 30L2 M + σ2 + d + 1 + f(0) 2

α log (1 + L2)d

4α ϵ2 log(24L2)

4Lσ2 + 12L2d

α ϵ2 log (1 + L2)d

4α ϵ2 + 32σ2 + 96Ld

Lα ϵ2 log (1 + L2)d

α log 30L2 M + σ2 + d + 1 + f(0) 2

α ϵ2 log(24L2)

34L2(σ2 + 3Ld)

α2 ϵ2 log(24L2) log (1 + L2)d

4α ϵ2 log 30L2 M + σ2 + d + 1 + f(0) 2

Faster Sampling via Stochastic Gradient Proximal Sampler

for all x0 pk+1/2. Hence, the total gradient complexity will be

K Ex0,b [S(x0, b)] = O(κ3ϵ 2 max{σ2, Ld}),

and the proof is completed.

C.2. Warm-started MALA Inner Samplers

We define the Re nyi divergence between two distributions as

Rr(p q) = 1 r 1 log Z p(x)

since it will be widely used in the following section. Then, we provide a detailed theoretical analysis.

Lemma C.5. Suppose [A1] holds and Alg 4 is implemented with following hyper-parameters settings:

3/η, τ = Θ δη1/2

, and S = Θ d1/2r1/2

δ log x0 2 + (η fb(0) )2 ,

the underlying distribution q S of the output particle i.e., z S will satisfy

Rr(q S q ) δ2,

where Rr denotes Re nyi divergence with order r.

Proof. We suppose the Inner ULD is implemented as Alg 4. We denote the underlying distribution of (zs, vs) as q s and its marginal distribution w.r.t. zs is denoted as qs. Since, we only consider Alg 4 rather than its outer loops, the target distribution of Alg 4 can be abbreviated as

q (z) exp( g(z)), q (z, v) exp g(z) v 2

, where g(z) := log pk+1|k+ 1

2 ,b(z|x0, b).

Combining Lemma B.2 and the choice of the step size, i.e., η 1/2L, we have

(2η) 1 I 2g(z) = 2q (z) (3/2η) I.

By data-processing inequality, we have Rr(q S q ) Rr(q S q ).

By the weak triangle inequality of Re nyi divergence, i.e., Lemma 7 in (Vempala & Wibisono, 2019), we have

Rr(q S q ) r 1/2

r 1 R2r(q S q ) + R2r 1( q q ).

It can be noted that r 1/2

r 1 will be bounded by 2 when q 3/2 and q denotes the underlying distribution of output particles if we initialize q 0 with q . Then, by combining Lemma E.9, Lemma E.10 and Lemma E.11, we conclude that

Rr(q S q ) δ2

if ULD is run with friction parameter γ, step size τ, and iteration complexity N that satisfy:

3/η, τ δη3/4

d1/2r1/2T 1/2 , and S η

τ log dη + x0 z 2 rη1/2

By recalling that T = Nτ, solving for these choices of parameters, and omitting logarithmic factors, we conclude that it suffices to run ULD with the following choices of parameters:

3/η, τ = Θ δη1/2

, and S = Θ d1/2r1/2

δ log x0 z 2 (46)

Faster Sampling via Stochastic Gradient Proximal Sampler

where z is the minimizer of g. Besides, the minimizer of g satisfies

g(z ) = fb(z ) + η 1 (z x0) = 0 x0 = η fb(z ) + z ,

which implies

x0 = η fb(z ) + z z η fb(z ) x0 + η fb(z ) z .

In this condition, it has

z x0 + η fb(z ) fb(0) + fb(0) x0 + Lη z + η fb(0)

where the second inequality follows from [A1]. Since, we require Lη 1/2, then the previous inequality is equivalent to

z 2 x0 + 2η fb(0) .

Plugging this results into Eq 46, the hyper-parameter choice of Alg 4 can be concluded as

3/η, τ = Θ δη1/2

, and S = Θ d1/2r1/2

δ log x0 2 + (η fb(0) )2 .

Lemma C.6 (Variant of Theorem 1 of (Wu et al., 2022)). Using the notations presented in Alg 3, suppose [A1] holds and Alg 3 is implemented when

τ = Θ ηd 1/2 log 2 max d, χ2(q0 q )

, and S = Θ d1/2 log3 χ2(q0 q )

Then, underlying distribution q S of the output particle i.e., z S will satisfy

TV (q S, q ) δ.

Proof. We suppose the Inner MALA is implemented as Alg 3. We denote the underlying distribution of (zs, vs) as q s and its marginal distribution w.r.t. zs is denoted as qs. Since, we only consider Alg 3 rather than its outer loops, the target distribution of Alg 3 can be abbreviated as

q (z) exp( g(z)), q (z, v) exp g(z) v 2

, where g(z) := log pk+1|k+ 1

2 ,b(z|x0, b).

Theorem 1 of (Wu et al., 2022) upper bound the total variation distance between the underlying distribution of output particles and the target distribution as

TV (q S, q ) Hs + Hs

where Hs is defined as Hs := sup {|q0(A) q (A)| : q (A) s}

and Φs denotes the s-conductance. The final step size and gradient complexity will depend on the warm-start M defining as Hs Ms. Since, we use χ2 distance to define the warm-start in our analysis. We have additionally the following inequality.

|q0(A) q (A)| =

s Z 1Adπ Z dq0

dq 1 2 dq p

q (A)χ2(q0 q ),

which means Hs p

sχ2(q0 q ). In this condition, we have

TV (q S, q ) p

sχ2(q0 q ) +

Faster Sampling via Stochastic Gradient Proximal Sampler

By requiring

4χ2(q0 p ) and S = 2

Φs log 8χ2(q0 p )

we can achieve TV (q S, p ) ϵ. Besides, we can obtain the M by

s = 2χ2(q0 q )

Since the target distribution q is (1/2η)-strongly convex and (3/2η)-smooth when η 1/(2L) due to Lemma B.2, plugging the choice of M shown in Eq 47 into Theorem 1 of (Wu et al., 2022), we know the step size should be

τ = Θ ηd 1/2 log 2 max d, χ2(q0 q )

and the gradient complexity will be

S = Θ d1/2 log3 χ2(q0 q )

Hence, the proof is completed.

Corollary C.7. Suppose [A1] holds, we implement Alg 4 with

3/η, τ = Θ η1/2

, and S = Θ d1/2 log x0 2 + (η fb(0) )2 ,

and implement Alg 3 with

τ = Θ ηd 1/2 log 2 max d, δ 1 , and S = Θ d1/2 log3 (1/δ) .

The underlying distribution q S of the output particle of Alg 3 will have

KL q S q δ,

and the total gradient complexity will be

Θ |b|d1/2 log x0 2 + (η fb(0) )2 + log3(1/δ) .

Proof. Using the notations in Alg 3, by Lemma C.5, Alg 4 can outputs a distribution q0 satisfying

R3(q0 q ) log 2,

which implies χ2(q0 q ) exp (R2(q0 q )) 1 exp (R3(q0 q )) 1 1.

It should be noted that the second inequality follows from the monotonicity of Re nyi divergence. In this condition, the gradient complexity of Alg 4 should be

|b| S = Θ |b|d1/2 log x0 2 + (η fb(0) )2 ,

where S denotes the iteration number of Alg 4, i.e., Line 2 of Alg 3. With the warm start in χ2 divergence, we invoke Lemma C.6 and achieve TV (q S, q ) δ2/5.

with the following gradient complexity

|b| S = Θ |b|d1/2 log3 (1/δ) .

Faster Sampling via Stochastic Gradient Proximal Sampler

Then, we start upper bound the KL divergence between q S and q and have

KL q S q χ2(q S q ) = Z q S(z)

q (z) 1 2 q (z)dz

q (z) 1 q (z)dz Z q S(z)

v u u t TV (q S, q )

TV (q S, q ) (exp (2R3(q S q )) + 1)

TV (q S, q ) (exp (2R3(q0 q )) + 1) δ,

where the second inequality follows from Cauchy Schwarz inequality, the second equation follows from the definition of Re nyi divergence, and the last inequality follows from data-processing inequality. Therefore, to ensure the convergence of KL divergence, i.e., KL q S q δ,

the total complexity of this warm start MALA will be

Θ |b|d1/2 log x0 2 + (η fb(0) )2 + log3(1/δ) .

Hence, the proof is completed.

Theorem C.8. Suppose [A1]-[A3] hold. With the following parameter settings

α log (1 + L2)d

4α ϵ2 , δk = 2ϵ2α

L log (1 + L2)d

bo = min σ2

4α ϵ2 log (1 + L2)d

4α ϵ2 , n ,

for Alg 1, if we choose Alg 3 as the inner sampler shown in Line 5 of Alg 1, set

6L, τ = Θ 1

, and S = Θ d1/2 log x0 2 + fb(0) 2

for Alg 4, and

d log 2 max d, L 2α ϵ2 log (1 + L2)d

and S = Θ d1/2 log3 L 2α ϵ2 log (1 + L2)d

for Alg 3, then the underlying distribution of returned particles p K in Alg 1 satisfies TV (p K+1, p ) < 3ϵ. In this condition, the expected gradient complexity will be Θ κ3d1/2σ2ϵ 2 .

Proof. We provide this proof with a similar proof roadmap shown in Theorem C.4. Specifically, we show the detailed implementation of Alg 1 with Alg 2 in the following.

For all k {0, 1, . . . , K 1}, the mini-batch bk in Alg 1 Line 2 has a uniformed norm which is denoted as |bk| = bo.

For all k {0, 1, . . . , K 1}, the conditional probability densities pk+1|k+1/2,b( |xk+1/2bk) in Alg 1 Line 4 formulated as Eq 6 share the same L-2 regularized coefficients, i.e., η 1.

For all k {0, 1, . . . , K 1}, the inner sampler shown in Alg 1 Line 5 is chosen as Alg 3.

By requiring

2L and K (2α η) 1 log (1 + L2)d

α log (1 + L2)d

Faster Sampling via Stochastic Gradient Proximal Sampler

4α (1 + α η) K

4α exp( α Kη) ϵ.

Besides by requiring

bo min Kησ2

2ϵ2 , n = min σ2

4α ϵ2 log (1 + L2)d

4α ϵ2 , n ,

we have σ p

Kη/(2bo) ϵ. Additionally, by requiring,

L log (1 + L2)d

With these conditions, we have

TV (p K, p )

4α (1 + α η) K 3ϵ

which follows from Theorem 3.1.

Errors control of inner loops. To determine the hyper-parameter settings of Alg 4 and Alg 3, we can plug the choice of outer loops step size η and inner loops error tolerance δ, i.e.,

2L and δ = 2ϵ2α

L log (1 + L2)d

into Corollary C.7. In this condition, for Alg 4, we set

6L, τ = Θ 1

, and S = Θ d1/2 log x0 2 + fb(0) 2

For Alg 3, we set

d log 2 max d, L 2α ϵ2 log (1 + L2)d

and S = Θ d1/2 log3 L 2α ϵ2 log (1 + L2)d

Then, the underlying distribution q S of the output particle of Alg 3 will satisfy

KL q S q 2ϵ2α

L log (1 + L2)d

and the total gradient complexity will be

Θ bod1/2 log x0 2 + (η fb(0) )2 + log3(1/δ) .

Since log(1/δ) will only provide additional log terms which will be omitted in Θ, we only consider the following inequality, i.e., Ex0,b h bod1/2 log x0 2 + (η fb(0) )2 i bod1/2 log E h x0 2i + η2E h fb(0) 2i

bod1/2 log E h x0 2i + 2η2 f(0) 2 + 2η2E h fb(0) f(0) 2i

4α ϵ2 log (1 + L2)d

E h x0 2i + f(0) 2

Faster Sampling via Stochastic Gradient Proximal Sampler

the first inequality follows from Jensen s inequality, the second follows from triangle inequality, and the last follows from [A3]. Here, we should note that the underlying distribution of random variable x0 is pk+1/2. Hence, the second moment bound, i.e., Mk+1/2 of pk+1/2 for any k [0, K 1] is required.

To solve this problem, we start with upper bounding the second moment, i.e., Mk of pk for any k [1, K]. For calculation convenience, we suppose L 1, δ < 1 without loss of generality and set

Cm := 4ηδ + 6σ2

η2 + 4 M + 6d

η 2 + 6σ2 + (24L2 + 4)M + 12Ld.

In this condition, following from Lemma 3.2, we have

η2 k Mk + 4ηkδk + 6σ2

η2 k + 4 M + 6d

ηk = 24L2Mk + Cm,

which implies

Mk 24L2 k M + Cm 1 + 24L2 + . . . + 24L2 k 1 24L2 k M + Cm 24L2 1

24L2 K M + 2 + 6σ2 + (24L2 + 4)M + 12Ld .

Additionally, Lemma 3.2 also demonstrates that

2 Mk + ηkd 24L2 K M + 2d + 6σ2 + (24L2 + 4)M + 12Ld

for all k [0, K 1]. Plugging the following inequality, i.e.,

4α ϵ2 log E h x0 2i Ld1/2σ2

4α2 ϵ2 log (1 + L2)d

4α ϵ2 log 24L2 log M + 2d + 6σ2 + (24L2 + 4)M + 12Ld

into the RHS of Eq 48 and omitting trivial log terms, we know the gradient complexity for each k will be Θ κ2d1/2σ2ϵ 2 . After multiplying the total iteration number of Alg 1, i.e., K, the final gradient complexity will be Θ κ3d1/2σ2ϵ 2 . Hence, the proof is completed.

D. Lemmas for Errors from Initialization of Inner Samplers

Proof of Lemma 3.2. We first suppose the second moment of ˆpk is upper bounded and satisfies Eˆpk[ x 2] mk.

According to Alg 1 Line 3, we have the closed form of the random variable ˆxk+1/2 is

2 = ˆxk + ηkξ, where ξ N(0, I).

Noted that ξ is independent with ˆxk, hence, we have

2 := E ˆxk+ 1

2 = E h ˆxk 2i + ηk d Mk + ηk d. (49)

Then, considering the second moment of xk+1, we have

E h ˆxk+1 2i = Z ˆpk+1(x) x 2 dx

b {1,2,...,n} ˆpk+1|k+ 1

2 ,b(x|y, b) pb(b)

b {1,2,...,n}

pb(b) Z ˆpk+ 1

2 (y) Z ˆpk+1|k+ 1

2 ,b(x|y, b) x 2 dx dy

Faster Sampling via Stochastic Gradient Proximal Sampler

Then, we focus on the innermost integration, suppose ˆγy( , ) as the optimal coupling between ˆpk+1|k+ 1

2 ,b( |y) and pk+1|k+ 1

2 ,b( |y). Then, we have Z ˆpk+1|k+ 1

2 ,b(x|y) x 2 dx 2 Z pk+1|k+ 1

2 ,b(x|y) x 2 dx

Z ˆγy(ˆx, x) ˆx 2 2 x 2 d(ˆx, x) Z ˆγy(ˆx, x) ˆx x 2 d(ˆx, x) = W 2 2 ˆpk+1|k+ 1

2 ,b, pk+1|k+ 1

2 ,b . (51)

Since pk+1|k+ 1

2 ,b is strongly log-concave, i.e.,

2 x pk+1|k+ 1

2 ,b(x |x, b) = 2fb(x ) + η 1I L + η 1 k I (2ηk) 1 I,

the distribution pk+1|k+ 1

2 ,b also satisfies (2ηk) 1 log-Sobolev inequality due to Lemma E.2. By Talagrand s inequality, we have W 2 2 ˆpk+1|k+ 1

2 ,b, pk+1|k+ 1

2 ,b 4ηk KL ˆpk+1|k+ 1

2 ,b pk+1|k+ 1

2 ,b 4ηkδk. (52)

Plugging Eq 51 and Eq 52 into Eq 50, we have

E h ˆxk+1 2i X

b {1,2,...,n}

pb(b) Z ˆpk+ 1

2 (y) 4ηkδk + 2 Z pk+1|k+ 1

2 ,b(x|y) x 2 dx dy . (53)

To upper bound the innermost integration, we suppose the optimal coupling between p and pk+1|k+ 1

2 ,b( |y) is γy( , ). Then it has Z pk+1|k+ 1

2 ,b(x|y) x 2 dx 2 Z p (x) x 2 dx

Z γy(x , x) x 2 2 x 2 d(x , x) Z γy(x , x) x x 2 d(x , x) = W 2 2 (p , pk+1|k+ 1

Since pk+1|k+ 1

2 ,b satisfies LSI with constant (2ηk) 1. By Talagrand s inequality and LSI, we have

W 2 2 (p , pk+1|k+ 1

2 ,b) 4ηk KL p pk+1|k+ 1

log p (x) pk+1|k+ 1

2 ,b(x|y, b)

Z p (x) fb(x) f(x) + x y

12η2 k Z p (x) fb(x) f(x) 2 dx + η 2 k

Z p (x) x 2 dx + η 2 k y 2 .

Combining this inequality with Eq 54, we have Z pk+1|k+ 1

2 ,b(x|y) x 2 dx 12η2 k

Z p (x) fb(x) f(x) 2 dx + 12M + 12 y 2 + 2M.

Plugging this inequality into Eq 53, we have

E h ˆxk+1 2i 4ηkδk + X

b {1,2,...,n} 24η2 k pb(b) Z ˆpk+ 1

2 (y) Z p (x) fb(x) f(x) 2 dx dy

b {1,2,...,n} 24 pb(b) Z ˆpk+ 1

2 (y) y 2 dy. (55)

According to [A3], suppose we sample b uniformly from {1, 2, . . . , n}, then for any x Rd we have

i=1 ( f(x) fbi(x))

j=1 E ( fbi(x) f(x)) ( fbj(x) f(x))

i=1 E h fbi(x) f(x) 2i = σ2

Faster Sampling via Stochastic Gradient Proximal Sampler

Plugging this equation into the second term of RHS of Eq 53, we have

b {1,2,...,n} pb(b) Z ˆpk+ 1

2 (y) Z p (x) fb(x) f(x) 2 dx dy

2 (y) Z p (x)Eb h fb(x) f(x) 2i dxdy = σ2

Besides, for the last term of RHS of Eq 53, we have

b {1,2,...,n} pb(b) Z ˆpk+ 1

2 (y) y 2 dy = Mk+ 1

With these conditions, Eq 55 can be reformulated as

Mk+1 := E h ˆxk+1 2i 4ηkδk + 24η2 kσ2

|b| + 28M + 24Mk+ 1

24 Mk + 4ηkδk + 24η2 kσ2

|b| + 28M + 24ηkd. (56)

where the last inequality follows from Eq 49. Hence, the proof is completed.

Remark D.1. According to Lemma 3.2, when L 1/5, We plug the following hyper-parameters settings, i.e.,

2 , and |bk| 6σ2

into Eq 56, then we have

Mk+1 Mk + 5(d + M) MK M + K 5(d + M) 6K(d + M),

which is the second moment bound along the update of Alg 1.

E. Auxiliary Lemmas

Lemma E.1. Suppose a function f can be decomposed as a finite sum, i.e., f(x) = 1/n Pn i=1 fi(x) where [A3] is satisfied. If we uniformly sample a minibatch b from {1, 2, . . . , n} which constructs a minibatch loss shown in Eq 3, then for any x Rd, we have

Eb h fb(x) f(x) 2i σ2

Proof. For minibatch variance, we have

i b ( f(x) fi(x))

j b ( fi(x) f(x)) ( fj(x) f(x))

i b fi(x) f(x) 2 #

Hence, the proof is completed.

Lemma E.2 (Variant of Lemma 10 in (Cheng & Bartlett, 2018)). Suppose log p is m-strongly convex function, for any distribution with density function p, we have

KL p p 1 2m

Z p(x) log p(x)

Faster Sampling via Stochastic Gradient Proximal Sampler

By choosing p(x) = g2(x)p (x)/Ep g2(x) for the test function g: Rd R and Ep g2(x) < , we have

Ep g2 log g2 Ep g2 log Ep g2 2

m Ep h g 2i ,

which implies p satisfies m-log-Sobolev inequality.

Lemma E.3 (Theorem 3 in (Chen et al., 2022)). Assume that p exp( f ) satisfies [A2]. For any η > 0, and any initial distribution p1 the k-th iterate pk of the proximal sampler with step size ηk satisfies

KL pk+1 p KL pk p (1 + α ηk) 2 ,

which means it has

KL pk+1 p KL p0 p

i=1 (1 + α ηk) 2 .

Lemma E.4. Suppose p exp( f ) defined on Rd satisfies α -log-Sobolev inequality where f satisfies [A1], p0 is the standard Gaussian distribution defined on Rd, then we have

KL p0 p (1 + L2)d

Proof. According to the definition of LSI, we have

Z p1(x) log p1(x)

2 dx = 1 2α

Z p1(x) x + f (x) 2 dx

Z p1(x) x 2 + L2 x 2 dx = (1 + L2)d

where the second inequality follows from the L-smoothness of f and the last equation establishes since Ep0[ x 2] = d is for the standard Gaussian distribution p0 in Rd.

Lemma E.5 (Convexity of KL divergence). Suppose {qi}i {1,2,...,n} and p are probability densities defined on Rd and {wi}i {1,2,...,n} are real numbers satisfying

i {1, 2, . . . , n} wi [0, 1] and

i=1 wi = 1.

i=1 wi KL qi p .

Proof. We first consider the case when n = 2, which means it is only required to prove

KL λq1 + (1 λ)q2 p λKL q1 p + (1 λ)KL q2 p (57)

for any λ [0, 1]. In this condition, we have

KL λq1 + (1 λ)q2 p = Z (λq1(x) + (1 λ)q2(x)) log(λq1(x) + (1 λ)q2(x))dx

Z (λq1(x) + (1 λ)q2(x)) log p(x)dx. (58)

Since φ(u) := u log u satisfies convexity, i.e.,

2φ(u) = u 1 > 0 u > 0,

Faster Sampling via Stochastic Gradient Proximal Sampler

which implies

λq1(x) + (1 λ)q2(x) log (λq1(x) + (1 λ)q2(x)) λq1(x) log q1(x) + (1 λ)q2(x) log q2(x),

then RHS of Eq 58 satisfies

RHS Z λq1(x) log q1(x)dx Z λq1(x) log p(x)dx

+ Z (1 λ)q2(x) log q2(x)dx Z (1 λ)q2(x) log p(x)dx = λKL q1 p + (1 λ)KL q2 p .

Then, for n > 2 case, we suppose

i=1 wi KL qi p . (59)

Then, by setting

q := Pn 1 i=1 wiqi 1 wn = Pn 1 i=1 wiqi Pn 1 i=1 wi ,

then we have

=KL (1 wn)q + wnqn p (1 wn)KL

q p + wn KL qn p

wi 1 wn KL qi p + wn KL qn p =

i=1 wi KL qi p ,

where the first inequality follows from Eq 57 and the last inequality follows from Eq 59. Hence, the proof is completed.

Lemma E.6 (Lemma 11 in (Vempala & Wibisono, 2019)). Suppose the density function satisfies p exp( f) where f is L-smooth, i.e., [A1]. Then, it has Ex p h f(x) 2i Ld.

Lemma E.7 (Lemma 5 in (Durmus et al., 2019)). Suppose the underlying distributions of random variables x and x+

2τξ are p and p respectively, where ξ N(0, I). If p, p P2(Rd) and Ep [log p ] < , then it has

2τ (Ex p [log p (x)] Ex p [log p (x)]) W 2 2 (p, p ) W 2 2 (p , p ).

Definition E.8 (Definition of Orlicz Wasserstein metric). The Orlicz Wasserstein metric between distributions p and q is

Wψ(p, q) := inf (x,y) Γ(p,q) x y ψ

x ψ := inf λ > 0 : E ψ x

Lemma E.9 (Theorem 4.4 in (Altschuler & Chewi, 2023)). Suppose q exp( g) where g is µ-strongly-convex and L-smooth. Let P denote the Markov transition kernel for underdamped Langevin dynamics (ULD) when run with friction paramter γ =

2L and step size τ 1/(κ

L). Then, for any target accuracy 0 < ϵ p

log 2/(i 1), any Re nyi divergence order i 1 and any two initial distributions q 0, q 1 P(R2d),

Ri(PNq 0 PNq 1) ϵ2,

if the number of ULD iteration is

L µτ log 2Wψ(q0, q )

where q0 is the marginal distribution of q 0 w.r.t. the first d dimensions and Wψ is defined as Definition E.8.

Faster Sampling via Stochastic Gradient Proximal Sampler

Lemma E.10 (Remark 4.2 in (Altschuler & Chewi, 2023)). Suppose q exp( g) where g is µ-strongly-convex and L-smooth. We run underdamped Langevin dynamics (ULD) when with friction paramter γ =

2L, step size τ 1/(κ

L) and initialize the distribution with q 0 = δx N(0, I),

then it has Wψ(q0, q ) p

where x denotes the minimizer of g.

Lemma E.11 (Lemma 4.8 in (Altschuler & Chewi, 2023)). Suppose q (z) exp( g(z)) where g is µ-strongly-convex and L-smooth. Let q (z, v) exp( g(z) v 2/2). Let P denote the Markov transition kernel for underdamped Langevin dynamics (ULD) when run with friction paramter γ

L and step size

τ L 3/4d 1/2i 1(T log N) 1/2,

where N is the total number of iterations and T = Nτ is the total elapsed time. Then,

Ri(PNq q ) L3/2dτ 2i T.

F. Additional Experiments

Due to space limitations, we defer some experimental details in Section 5 to this part.

In our experiments, we fix the number of stochastic gradient usage at 12000. As the primary goal of our experiments is to verify our theory, we set the inner batch size, i.e., bs = 1. Additionally, to be more comparable with SGLD, we set S = S 1. Under these conditions, we primarily focus on tuning three other hyper-parameters. Among them, the inner step size τ is chosen from the set {0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4}, which somewhat corresponds to the step size in SGLD. The inner iteration S is chosen from {20, 40, 80}, which also determines K = 12000/S. The outer step size η is a special hyper-parameter in SPS-SGLD, which corresponds to the coefficient of quadratic regularizer in RGO. As our theory requires it to be larger than τ in our theory, we choose it from {1.0, 4.0, 10.0} in our experiments. The optimal hyper-parameters obtained through grid search are presented in Table 2.

Hyper-Params Dimensions d = 10 d = 20 d = 30 d = 40 d = 50

Inner step size τ 0.4 0.4 0.4 0.4 0.4 Inner iteration number S 40 20 20 80 80 Outer step size η 4.0 4.0 10.0 10.0 10.0

Table 2. Hyper-parameter settings for different dimension tasks based on the grid search.

For the choice of these hyper-parameters, the inner step size somewhat corresponds to the step size in SGLD and can be set in the same order of magnitude. The outer step size η is a special hyper-parameter in SPS-SGLD, it requires to be larger than τ in our theory and experiments. Furthermore, our theory indicates that the inner iteration number, i.e., S, is in the same order of magnitude as η/τ. This principle of the hyper-parameter choice can be roughly verified by the optimal hyper-parameter settings shown in Table 2. Moreover, we conduct a grid search for bs under our experimental settings. It is

Inner batch size Dimensions d = 10 d = 20 d = 30 d = 40 d = 50

bs = 1 0.105 0.063 0.064 0.060 0.055 bs = 5 0.143 0.078 0.081 0.074 0.082 bs = 10 0.138 0.092 0.086 0.122 0.110 bs = 20 0.175 0.107 0.090 0.142 0.117

Table 3. The marginal accuracy results under different bs settings.

worth noting that since we fix the gradient usage, increasing the inner batch size will cause the iteration number to decrease sharply. Consequently, the overall performance in our experiments is worse than that observed with the bs = 1 setting.

Faster Sampling via Stochastic Gradient Proximal Sampler

Although we only provide gradient complexity in our theory, both SGLD and SPS-SGLD are first-order samplers, with the primary computational cost stemming from the number of gradient calculations referred to as gradient complexity in our paper. Consequently, we can assert that SGLD and SPS-SGLD have nearly the same computational cost when the number of gradient calls is fixed, which is set at 12k in our experiments. To substantiate this claim, we present the wall clock time under 12k gradient calls (normalizing SPS-SGLD wall clock time to 1) in the table below.

Algorithms Dimensions d = 10 d = 20 d = 30 d = 40 d = 50

SPS-SGLD 1 1 1 1 1 SGLD 0.971 0.968 0.981 0.970 0.969

Table 4. The wall clock time comparison between SPS-SGLD and SGLD.

Moreover, we add some other baselines, e.g., such as AB-SGLD and CC-SGLD proposed by Das et al. (2023). We selected these variants because they achieved the best theoretical results, apart from our own. With target distributions set as shown in Section 5, the total variation distance performance for different algorithms is presented below. The results

Algorithms Dimensions d = 10 d = 20 d = 30 d = 40 d = 50

SPS-SGLD 0.105 0.063 0.064 0.060 0.055 CC-SGLD 0.143 0.125 0.105 0.121 0.114 AB-SGLD 0.154 0.129 0.121 0.120 0.119 vanila-SGLD 0.176 0.144 0.122 0.131 0.134

Table 5. The marginal accuracy results comparison among SPS-SGLD and other SGLD variants.

demonstrate that SPS-SGLD significantly outperforms CC-SGLD and AB-SGLD. Furthermore, such SGLD variants can also be incorporated as inner samplers within our framework, potentially enhancing the performance of SPS-type methods even further. Additionally, we would be happy to modify the name to distinguish it from SGLD variants, such as CC-SGLD and AB-SGLD.