# a_proximal_algorithm_for_sampling__10ca0fb5.pdf

Published in Transactions on Machine Learning Research (06/2023)

A Proximal Algorithm for Sampling

Jiaming Liang jiaming.liang@yale.edu Department of Computer Science, Yale University, New Haven, CT 06511. Yongxin Chen yongchen@gatech.edu School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332.

Reviewed on Open Review: https: // openreview. net/ forum? id= Ck XOwlhf27

We study sampling problems associated with potentials that may lack smoothness or convexity. Departing from the standard smooth setting, the potentials are only assumed to be weakly smooth or non-smooth, or the summation of multiple such functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling for both non-convex and convex potentials that are not necessarily smooth. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves a better complexity than all existing methods.

1 Introduction

The problem of drawing samples from an unnormalized probability distribution plays an essential role in data science and scientiﬁc computing (Durmus et al., 2018; Clarage et al., 1995; Maximova et al., 2016). It has been widely used in many areas such as Bayesian inference, Bayesian neural networks, probabilistic graphical models, biology, and machine learning (Gelman et al., 2013; Kononenko, 1989; Koller & Friedman, 2009; Krauth, 2006; Sites Jr & Marshall, 2003; Durmus et al., 2018). Compared with optimization oriented methods, sampling has the advantage of being able to quantify the uncertainty and conﬁdence level of the solution, and often provides more reliable solutions to engineering problems. This advantage comes at the cost of higher computational cost. It is thus important to develop more eﬃcient sampling algorithms.

In the classical setting of sampling, the potential function f of an unnormalized target distribution exp( f(x)) is assumed to be smooth and (strongly) convex. Over the past decades, many sampling algorithms have been developed, including Langevin Monte Carlo (LMC), kinetic Langevin Monte Carlo (KLMC), Hamiltonian Monte Carlo (HMC), Metropolis-adjusted Langevin algorithm (MALA), etc (Dalalyan, 2017; Grenander & Miller, 1994; Parisi, 1981; Roberts & Tweedie, 1996; Dalalyan & Riou-Durand, 2020; Bou-Rabee & Hairer, 2013; Roberts & Stramer, 2002; Roberts & Tweedie, 1996; Neal, 2011). Many of these algorithms are based on some type of discretization of the Langevin diﬀusion or the underdamped Langevin diﬀusion. They resemble the gradient-based algorithms in optimization. These algorithms work well for (strongly) convex and smooth potentials; many non-asymptotic complexity bounds have been proven. However, the cases where either the convexity or the smoothness is lacking are much less understood (Chewi et al., 2021; Chen et al., 2022; Chatterji et al., 2020; Erdogdu & Hosseinzadeh, 2021; Liang & Chen, 2022; Dalalyan et al., 2022; Balasubramanian et al., 2022; Mou et al., 2022a).

In this paper, we consider the challenging task of sampling from potentials that are not smooth and even not convex. Many sampling problems in real applications fall into this setting. For instance, Bayesian neural networks are highly non-convex models corresponding to probability densities with multi-modality (Izmailov et al., 2021). The lack of smoothness is due to the use of activation functions such as Re LU. The goal of this work is to develop an eﬃcient algorithm with provable guarantees for a class of potentials that are not smooth. In particular, we consider potential functions that are semi-smooth, deﬁned by (1).

Published in Transactions on Machine Learning Research (06/2023)

Inspired by the recent line of research that lies in the interface of sampling and optimization, we examine this sampling task from an optimization perspective. We build on the intuition of proximal algorithms for non-smooth optimization problems and develop a proximal algorithm to sample from non-convex and semi-smooth potentials. Our algorithm is based on the alternating sampling framework (ASF) (Lee et al., 2021) developed recently to sample from strongly convex potentials. In a nutshell, the ASF is a Gibbs sampler over a carefully designed augmented distribution of the target and one can thus sample from the target distribution by sampling from the augmented distribution. The convergence results of ASF have been recently improved (Chen et al., 2022) to cover non-log-concave distributions that satisfy functional inequalities such as the Logarithmic Sobolev inequality (LSI) and the Poincaré inequality (PI).

The ASF is an idealized algorithm that is not directly implementable. In each iteration, it needs to query the so-called restricted Gaussian oracle (RGO), which is itself a sampling task from a quadratically regularized distribution exp( f(x) 1

2η x y 2) for some given η > 0 and y Rd. The RGO can be viewed as a sampling counterpart of the proximal map in optimization. The total complexity of ASF thus depends on that of the RGO. Except for a few special cases where f has certain structures, the RGO is usually a challenging task. One key contribution of this work is a practical and eﬃcient algorithm for RGO for potentials that are neither smooth nor convex. This algorithm extends the recent work Liang & Chen (2022) for convex and non-smooth potentials. Combining the ASF and our algorithm for RGO, we establish a proximal algorithm for sampling from (convex or non-convex) semi-smooth potentials. Note that the techniques used in Liang & Chen (2022) are no longer applicable to these non-convex semi-smooth settings. In this work, we developed a new algorithm and associated proof techniques for eﬃcient RGO implementation.

Our contributions are summarized as follows. i) We develop an eﬃcient sampling scheme of RGO for semismooth potentials which can be either convex or non-convex and bound its complexity with a novel technique. ii) We combine our RGO scheme and the ASF to develop a sampling algorithm that can sample from both convex and non-convex semi-smooth potentials. Our algorithm has a better non-asymptotic complexity than almost all existing methods in the same setting. iii) We further extend our algorithms for potentials that are the summation of multiple semi-smooth functions.

Table 1: Complexity bounds for sampling from non-convex potentials. Source Complexity Assumption Metric

Chewi et al. (2021) O C1+1/α PI L2/α α d2+1/α

weakly smooth PI, α > 0

this paper (Thm. 3.5)

O CPIL2/(1+α) α d2

semi-smooth PI Rényi

Nguyen et al. (2021)

C1+max{1/αi} LSI

n max{L2 αi}d ε

αi > 0, composite semi-smooth, LSI

this paper (Thm. 4.4)

O CLSI Pn i=1 L2/(αi+1) αi d

composite semi-smooth, LSI

this paper (Thm. 4.5)

O CPI Pn i=1 L2/(αi+1) αi d

composite semi-smooth, PI

Related works: MCMC sampling from non-convex potentials has been investigated in Raginsky et al. (2017); Vempala & Wibisono (2019); Wibisono (2019); Chewi et al. (2021); Erdogdu & Hosseinzadeh (2021); Mou et al. (2022b); Luu et al. (2021). There have also been some works on sampling without smoothness Lehec (2021); Durmus et al. (2019); Chatterji et al. (2020); Chewi et al. (2021); Mou et al. (2022a); Shen et al. (2020); Durmus et al. (2018); Freund et al. (2022); Salim & Richtárik (2020); Bernton (2018); Nguyen et al. (2021); Liang & Chen (2022). The literature for the case where the potential function lacks both convexity and smoothness is rather scarce. In Nguyen et al. (2021); Erdogdu & Hosseinzadeh (2021), the authors analyze the convergence of LMC for weakly smooth potentials that satisfy a dissipativity condition. The target distribution is assumed to satisfy some functional inequality. Erdogdu & Hosseinzadeh (2021) also assumes an additional tail growth condition. The dissipativity condition is removed in Chewi et al.

Published in Transactions on Machine Learning Research (06/2023)

(2021). The results in Nguyen et al. (2021) are applicable to potentials that are the summation of multiple weakly smooth functions, while those in Chewi et al. (2021); Erdogdu & Hosseinzadeh (2021) are not. To compare our results with them, we make the simpliﬁcation that the initial distance, either in KL or Rényi divergence, to the target distribution is O(d). The results in cases with non-convex potentials are presented in Table 1. It can be seen that our complexities have better dependence on all the parameters: LSI constant CLSI, PI constant CPI, weakly smooth coeﬃcients Lα, and dimension d. Moreover, our complexity bounds depend polylogarithmically on the accuracy ε (thus ε does not appear in the O notation) while all the other results have polynomial dependence on 1/ε.

2 Problem formulation and Background

We are interested in sampling from distributions with potentials that are not necessarily smooth. More speciﬁcally, we consider the sampling task with the target distribution ν exp( f(x)) where the potential f satisﬁes

f (u) f (v)

i=1 Lαi u v αi, u, v Rd (1)

for some αi [0, 1] and Lαi > 0, 1 i n. Here f denotes a subgradient of f in the Frechet subdiﬀerential (see Deﬁnition A.1). When n = 1 and α1 = 1, it is well-known that f is a smooth function. When n = 1 and 0 < α1 < 1, f satisfying (1) is known as a weakly smooth function. When n = 1 and α1 = 0, f satisfying (1) is a non-smooth function. Thus, the cases we consider cover all three cases: smooth, weakly-smooth, and non-smooth. For ease of reference, we termed a function satisfying the condition (1) a semi-smooth function. The target distribution ν is assumed to satisfy LSI or PI, but f can be non-convex. The class of distributions we consider have been studied in (Chatterji et al., 2020; Chewi et al., 2021; Nguyen et al., 2021) for MCMC sampling. One example of such a distribution is ν exp( A2σ(A1x) b 2 x 2 x ) for some full rank matrices A1, A2, vector b, and some activation (e.g., Re LU) function σ. This is a Bayesian regression problem.

In this work, we aim to develop a proximal algorithm for sampling from non-convex potentials that satisfy (1). Our method is built on the alternating sampling framework (ASF) introduced in Lee et al. (2021). Unlike most existing sampling algorithms that require the potential to be smooth, ASF is applicable to semi-smooth problems. Initialized at x0 ρX 0 , ASF with target distribution πX(x) exp( f(x)) and stepsize η > 0 performs the two alternating steps as follows.

Algorithm 1 Alternating Sampling Framework (Lee et al., 2021)

1. Sample yk πY |X(y | xk) exp[ 1

2η xk y 2] 2. Sample xk+1 πX|Y (x | yk) exp[ f(x) 1 2η x yk 2]

The ASF is a special instance of the Gibbs sampling (Geman & Geman, 1984) from

π(x, y) exp f(x) 1

2η x y 2 . (2)

In Algorithm 1, sampling yk given xk in step 1 is easy since πY |X(y | xk) = N(xk, ηI) is an isotropic Gaussian distribution. Sampling xk+1 given yk in step 2 is however a highly nontrivial task; it corresponds to the restricted Gaussian oracle for f (Lee et al., 2021), deﬁned as follows.

Deﬁnition 2.1 Given a point y Rd and stepsize η > 0, the restricted Gaussian oracle (RGO) for f : Rd R is a sampling oracle that returns a random sample from a distribution proportional to exp( f( ) y 2/(2η)).

The RGO can be viewed as the sampling counterpart of the proximal map in optimization that is widely used in proximal algorithms for optimization (Parikh & Boyd, 2014). The ASF is an idealized algorithm;

Published in Transactions on Machine Learning Research (06/2023)

an eﬃcient implementation of the RGO is crucial to use this framework in practice. For some special cases of f, the RGO admits a computationally eﬃcient realization (Mou et al., 2022a; Shen et al., 2020; Liang & Chen, 2022). For general f, especially semi-smooth ones considered in this work, it was not clear how to realize the RGO eﬃciently.

Under the assumption that the RGO in the ASF can be eﬃciently realized, the ASF exhibits surprising convergence properties. It was ﬁrstly established in Lee et al. (2021) that, when f is strongly convex, Algorithm 1 converges linearly. This result is further improved recently in Chen et al. (2022) under various weaker assumptions on the target distribution πX exp( f). We summarize below several convergence results established in Chen et al. (2022) that will be used.

Throughout the paper, we abuse notation by identifying a probability measure with its density w.r.t. Lebesgue measure. To this end, for two probability distributions ρ ν, we denote by

Hν(ρ) := Z ρ log ρ

ν , χ2 ν(ρ) := Z ρ2

ν 1, Rq,ν(ρ) := 1 q 1 log Z ρq

the KL divergence, the Chi-squared divergence, and the Rényi divergence, respectively. Note that R2,ν = log(1+χ2 ν), R1,ν = Hν, and Rq,ν Rq ,ν for any 1 q q < (Rényi, 1961; Vempala & Wibisono, 2019). We denote by W2 the Wasserstein-2 distance (Villani, 2021). Recall a probability distribution ν satisﬁes LSI with constant CLSI > 0 (1/CLSI-LSI) if for every ρ, Hν(ρ) CLSI

2 Jν(ρ), where the Fisher information Jν(ρ) is deﬁned as Jν(ρ) = Eρ[ log ρ

ν 2]. A probability distribution ν satisﬁes PI with constant CPI > 0 (1/CPI-PI) if for any smooth bounded function ψ : Rd R, we have Eν[(ψ Eν(ψ))2] CPIEν[ ψ 2].

Theorem 2.2 ((Chen et al., 2022, Theorem 3)) Assume that πX exp( f) satisﬁes λ-LSI. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisﬁes

HπX(ρX k ) HπX(ρX 0 )

(1 + λη)2k .

Moreover, for all q 1,

Rq,πX(ρX k ) Rq,πX(ρX 0 )

(1 + λη)2k/q .

Theorem 2.3 ((Chen et al., 2022, Theorem 4)) Assume πX exp( f) satisﬁes λ-PI. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisﬁes

χ2 πX(ρX k ) χ2 πX(ρX 0 )

(1 + λη)2k .

Moreover, for all q 2,

Rq,πX(ρX k )

( Rq,πX(ρX 0 ) 2k log(1+λη)

q , if k q 2 log(1+λη) (Rq,πX(ρX 0 ) 1) ,

1/(1 + λη)2(k k0)/q , if k k0 := q 2 log(1+λη) (Rq,πX(ρX 0 ) 1) .

Theorem 2.4 ((Chen et al., 2022, Theorem 2)) Assume that πX exp( f) is log-concave. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisﬁes

HπX(ρX k ) W 2 2 (ρX 0 , πX) kη .

To use the ASF for sampling problems, we need to realize the RGO with eﬃcient implementations. In the rest of this paper, we develop an eﬃcient algorithm for RGO associated with a potential satisfying (1), and then combine it with the ASF to establish a proximal algorithm for sampling. The complexity of the proximal algorithm can be obtained by combining the above convergence results for ASF and the complexity results we establish for RGO. The rest of the paper is organized as follows. In Section 3 we consider a special

Published in Transactions on Machine Learning Research (06/2023)

case of (1) with n = 1, develop an eﬃcient implementation for RGO via rejection sampling, and establish complexity results for sampling from distributions with non-convex and semi-smooth potentials. In Section 4, we extend the aforementioned complexity results to the general cases (1). In Section 5, we specialize (1) to the case where f is convex. In Section 6, we present preliminary computational results to demonstrate the eﬃcacy of the proximal sampling algorithm. In Section 7, we present some concluding remarks. Finally, in Appendices A-E, we present technical results and proofs omitted in the paper and provide a self-contained discussion on solving a subproblem in the proximal sampling algorithm.

3 Proximal sampling for non-convex and semi-smooth potentials

Our main objective in this section is to establish complexity results for sampling from distributions with non-convex and semi-smooth potentials satisfying (1) with n = 1, i.e.,

f (u) f (v) Lα u v α, u, v Rd. (3)

We refer to such a function an Lα-α-semi-smooth function.

The bottleneck of Algorithm 1 for sampling from a general distribution exp( f) is an eﬃcient realization of the RGO, i.e., step 2 of Algorithm 1. To address this issue, we develop an eﬃcient algorithm for the corresponding RGO based on rejection sampling. We show that, with a carefully designed proposal and a suﬃciently small η, the expected number of rejection sampling steps to obtain one eﬀective sample in RGO turns out to be bounded above by a dimension-free constant. The core to achieving such a constant bound on the expected rejection steps is a novel construction of proposal distribution that does not rely on the convexity of f and a reﬁned analysis that captures the nature of semi-smooth functions. We utilize a useful property of semi-smooth functions that they can be approximated by smooth functions to arbitrary accuracy, at the cost of increasing their smoothness parameters. This is formalized in the following lemma, of which the proof is postponed to Appendix B. Relevant ideas have been explored in Devolder et al. (2014); Nesterov (2015) to design universal methods for convex semi-smooth optimization problems.

Lemma 3.1 Assume f is an Lα-α-semi-smooth function, then for δ > 0 and every u, v Rd

|f(u) f(v) f (v), u v | M

2 u v 2 + (1 α)δ

1 α α+1 . (5)

Our algorithm is inspired by Liang & Chen (2022), which also uses rejection sampling for RGO. The proposal of rejection sampling used in Liang & Chen (2022) is a Gaussian distribution centered at an (approximate) minimizer of the regularized potential function. We thus consider the regularized optimization problem

proxηf(y) := argmin x Rd f η y (x) := f(x) + 1

2η x y 2 , (6)

where y Rd is given. Departing from the convex setting studied in Liang & Chen (2022), when f is nonconvex and semi-smooth, (6) may not be a convex optimization regardless of the value of η. Nevertheless, thanks to Lemma 3.1, f η y is close to a strongly convex and smooth function up to some approximation error, and (6) can still be solved eﬃciently using convex smooth optimization algorithms such as Nesterov s acceleration. We describe a variant of the method in Algorithm 3 in Appendix D. The following proposition presents a complexity result of Algorithm 3 for ﬁnding an approximate stationary point of f η y with a small η. Its proof is postponed to Appendix B.

Proposition 3.2 Assume η 1 Md, and let w Rd be an approximate stationary point of f η y , i.e.,

Md, s = f (w) + 1

η (w y), (7)

where M is as in (5). Then, the iteration-complexity to ﬁnd w with Algorithm 3 is O(1).

Published in Transactions on Machine Learning Research (06/2023)

With this approximate stationary point, we obtain the following key ingredients of our rejection samplingbased RGO.

Lemma 3.3 Let w Rd be a stationary point of f η y , i.e.,

η (w y) = 0, (8)

and w be an approximate stationary point as in (7). Deﬁne

hw 1,y(x) := f(w) + f (w), x w M

2 x w 2 + 1

2η x y 2 (1 α)δ

hw 2,y(x) := f(w ) + f (w ), x w + M

2 x w 2 + 1

2η x y 2 + (1 α)δ

Then, we have for every x Rd,

hw 1,y(x) f η y (x) hw 2,y(x). (11)

Proof: Inequalities in (11) directly follow from (4) and the deﬁnitions of hw 1,y and hw 2,y in (9) and (10), respectively.

We are now ready to present the rejection sampling algorithm (Algorithm 2) for RGO.

Algorithm 2 RGO Rejection Sampling

1. Compute an approximate solution w satisfying (7) with Algorithm 3 2. Generate sample X exp( hw 1,y(x)) 3. Generate sample U U[0, 1] 4. If

U exp( f η y (X)) exp( hw 1,y(X)),

then accept/return X; otherwise, reject X and go to step 2.

By construction, the proposal exp( hw 1,y(x)) is close to the target πX|Y (x | y) exp( f η y (x)) when η is suﬃciently small, and hence the expected number of rejection steps is small. The following proposition rigorously justiﬁes this intuition.

Proposition 3.4 Assume f is Lα-α-semi-smooth and let f η y be as in (6). Then X generated by Algorithm 2 follows the distribution πX|Y (x | y) exp f η y (x) . Moreover, if

η 1 Md = [(α + 1)δ]

2 α+1 α d , (12)

then the expected number of rejection steps in Algorithm 2 is at most exp 3(1 α)δ

Proof: It is well-known in rejection sampling X πX|Y (x|y). By deﬁnition, the probability that X is accepted is

U exp( f η y (X)) exp( hw 1,y(X))

= Z exp( f η y (x)) exp( hw 1,y(x)) exp( hw 1,y(x)) R exp( hw 1,y(z))dz dx =

R exp( f η y (x))dx R exp( hw 1,y(x))dx.

Published in Transactions on Machine Learning Research (06/2023)

Using the above identity, (11), and Lemma A.4, we have

U exp( f η y (X)) exp( hw 1,y(X))

R exp( hw 2,y(x))dx R exp( hw 1,y(x))dx

d/2 exp 1 2η w 2 f(w ) + f (w ), w 1 2η y 2 1 α

exp 1 2η w 2 + η 2(1 ηM) s 2 f(w) + 1

η w, y w 1 2η y 2 + 1 α

d/2 exp (1 α)δ η 2(1 ηM) s 2

2η w 2 f(w ) + f(w) + f (w ), w 1

η w, y w . (13)

It follows from (4) with (u, v) = (w, w ) that

f(w) + f(w ) + f (w ), w w M

2 w w 2 + (1 α)δ

which together with (8) implies that

2η w 2 f(w ) + f(w) + f (w ), w 1

2η w 2 f(w ) + f(w) + f (w ), w 1

η w, ηf (w ) + w w

2η w w 2 f(w ) + f(w) + f (w ), w w

2 w w 2 (1 α)δ

where the last inequality is due to (12). Plugging the above inequality into (13), we obtain

U exp( f η y (X)) exp( hw 1,y(X))

d/2 exp 3(1 α)δ

2 η 2(1 ηM) s 2 .

Hence, using the above bound, (12) and (7), we arrive at the following bound on the expected number of rejection iterations

P U exp( f η y (X)) exp( hw 1,y(X)) 1 + ηM

d/2 exp 3(1 α)δ

2 + η 2(1 ηM) s 2

1 + 2ηM 1 ηM

d/2 exp 3(1 α)δ

2 + η s 2 (1 + 4ηM)d/2 exp 3(1 α)δ

d/2 exp 3(1 α)δ

2 + 1 exp 3(1 α)δ

Note that δ is a tunable parameter. Choosing a large δ makes M small and η large in view of (5) and (12), respectively. Such a choice results in a better complexity of ASF but a larger number of expected rejection steps in RGO. Combining Proposition (3.4) and the convergence results of ASF we obtain the following nonasymptotic complexity bound to sample from non-convex semi-smooth potentials. This complexity bound is better than all existing results when α [0, 1), see Table 1 for comparison.

Theorem 3.5 Assume f is Lα-α-semi-smooth and πX exp( f) satisﬁes PI with constant CPI. With

initial distribution ρX 0 and stepsize η 1/(L

2 α+1 α d), Algorithm 1 using Algorithm 2 as an RGO has the

Published in Transactions on Machine Learning Research (06/2023)

iteration-complexity bound O CPIL

2 α+1 α d log χ2 πX(ρX 0 ) (14)

to achieve ε error to the target πX in terms of Chi-squared divergence, and

2 α+1 α qd Rq,πX(ρX 0 ) , q 2 (15)

to achieve ε error in Rényi divergence Rq,πX. Each iteration queries O(1) subgradients of f.

Proof: The result is a direct consequence of Theorem 2.3, Proposition 3.2 and Proposition 3.4 with the

choice of stepsize η 1/(L

2 α+1 α d).

Remark: When the coordinate is scaled by γ as x γx, by deﬁnition, the semi-smoothness coeﬃcient becomes γα+1Lα and the PI constant becomes CPI/γ2. By Proposition 3.4, to attain O(1) complexity for

the RGO, the stepsize should be η [(α+1)δ] 1 α α+1

(γα+1Lα) 2 α+1 d = [(α+1)δ] 1 α α+1

2 α+1 α d . Thus, applying Theorem 2.3, the total

complexity remains the same as that without coordinate scaling.

4 Proximal sampling for non-convex composite potentials

This section is devoted to the sampling from distributions with non-convex composite potential f satisfying (1). Results presented in this section generalize those in Section 3, which are for the setting with n = 1. Although Section 3 is developed for the simple case where n = 1 in (1), the proof techniques apply to the general case of (1). Hence, to avoid duplication, we present the following results analogous to those in Section 3 without giving proofs.

The following lemma is a direct generalization of Lemma 3.1, which shows that functions satisfying (1) can be approximated by smooth functions up to some controllable approximation errors.

Lemma 4.1 Assume f satisﬁes (1), then for any δ > 0, we have

|f(u) f(v) f (v), u v | Mn

2 , u, v Rd, (16)

[(αi + 1)δ]

1 αi αi+1 . (17)

The next proposition is a counterpart of Proposition 3.2 and gives the complexity of solving optimization problem (6) in the context of (1).

Proposition 4.2 Assume η 1/(Mnd), and let wn Rd be an approximate stationary point of f η y such that f (wn) + 1

η(wn y) Mnd. Then, the iteration-complexity to ﬁnd wn with Algorithm 3 is O(1).

Again, the core to the proximal algorithm (Algorithm 1) is an eﬃcient implementation of RGO. We use Algorithm 2 with slight modiﬁcation as a realization of RGO in the context of (1). First, in step 1 of Algorithm 2, we use Algorithm 3 to compute wn as in Proposition 4.2 instead of w. Second, in steps 2 and 4 of Algorithm 2, instead of using hw 1,y as in (9), we deﬁne hw 1,y as follows,

hw 1,y(x) = f(w) + f (w), x w Mn

2 x w 2 + 1

where Mn is as in (17). The next proposition is a generalization of Proposition 3.4.

Published in Transactions on Machine Learning Research (06/2023)

Proposition 4.3 Assume f satisﬁes (1) and let f η y be as in (6). Then X generated by Algorithm 2 with modiﬁcation follows the distribution πX|Y (x | y) exp f η y (x) . Moreover, if η 1 Mnd, then the expected

number of rejection steps in Algorithm 2 is at most exp 3Pn

In the rest of this section, we present two results on the complexity of sampling for non-convex composite potentials where the target distribution satisﬁes either LSI or PI. Our complexity bounds are better than all existing results when min αi [0, 1), see Table 1 for comparison.

Theorem 4.4 Assume f satisﬁes (1) and πX exp( f) satisﬁes LSI with constant CLSI. With initial

distribution ρX 0 and stepsize η 1/ Pn i=1 L

2 αi+1 αi d , Algorithm 1 using the modiﬁed Algorithm 2 as an

RGO has the iteration-complexity bound

2 αi+1 αi d

to achieve ε error to πX in KL divergence. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from Gaussian distribution.. Moreover, for all q 1, the iteration-complexity bound to achieve ε error to πX in Rényi divergence Rq,πX is

2 αi+1 αi dq

Proof: This theorem is a direct consequence of Theorem 2.2 and Propositions 4.2-4.3.

Theorem 4.5 Assume f satisﬁes (1) and πX exp( f) satisﬁes PI with constant CPI. With initial

distribution ρX 0 and stepsize η 1/ Pn i=1 L

2 αi+1 αi d , Algorithm 1 using the modiﬁed Algorithm 2 as an

RGO has the iteration-complexity bound

2 αi+1 αi d log χ2 πX(ρX 0 )

to achieve ε error to the target πX in terms of Chi-squared divergence, and

2 αi+1 αi qd Rq,πX(ρX 0 )

to achieve ε error in Rényi divergence Rq,πX. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from Gaussian distribution.

Proof: This theorem immediately follows from Theorem 2.3 and Propositions 4.2-4.3.

5 Proximal sampling for convex composite potentials

For completeness, in this section we present a complexity result for sampling from distributions with convex potential f satisfying (1). We only consider the weakly convex setting here as strongly log-concave distributions satisfy LSI and are thus covered by Theorem 4.4. The result is a consequence of Theorem 2.4 for log-concave densities combined with our RGO implementation. In particular, Lemma 4.1 and Propositions 4.2-4.3 apply to this setting. Hence, using Theorem 2.4 and Propositions 4.2-4.3, we directly obtain the following complexity result. Note our result is applicable to composite potentials with any number of components while most existing results are only applicable to composite potentials with at most two components. We also present an alternative result via a regularization approach in Appendix C.

Published in Transactions on Machine Learning Research (06/2023)

Theorem 5.1 Assume f satisﬁes (1) and πX exp( f) is log-concave. With initial distribution ρX 0 and

stepsize η 1/ Pn i=1 L

2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO has the iteration-complexity

W 2 2 (ρX 0 , πX) Pn i=1 L

2 αi+1 αi d ε

to achieve ε error to the target πX in terms of KL divergence. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from standard Gaussian distribution.

Proof: This theorem is a direct consequence of Theorem 2.4 and Propositions 4.2-4.3.

6 Computational results

In this section, we present a numerical example to illustrate our result. We consider sampling from a Gaussian-Laplace mixture

ν(x) = 0.5(2π) d/2p

det Q exp( (x 1) Q(x 1)/2) + 0.5(2d) exp( 4x 1)

where Q = USU , d = 5, S = diag(14, 15, 16, 17, 18), and U is an arbitrary orthogonal matrix.

We run 500, 000 iterations (with 100, 000 burn-in iterations) for both the proximal sampling algorithm and LMC with η = 1/(Md) where d = 5 and M is as in (5) with (α, Lα) = (1, 27) and δ = 1. Histograms and trace plots (of the 3-rd coordinate) of the samples generated by both methods are presented in Figures 1 and 2. The average numbers of optimization iterations and rejection sampling iterations in Algorithm 2 are 1.5 and 1.3, respectively. In addition, we also run 2, 500, 000 iterations (with 500, 000 burn-in iterations) for LMC with η = 1/(5Md). Figure 3 presents the histogram and trace plot for this LMC. The red curves in histograms are the scaled target density ν(x). Figures 4 and 5 show the results for MALA in the same settings as those of Figures 2 and 3, respectively. In these experiments, we observe that our algorithm achieves better accuracy under the same computational budget constraints.

Figure 1: Gaussian-Lasso mixture using the proximal sampling algorithm

Published in Transactions on Machine Learning Research (06/2023)

Figure 2: Gaussian-Lasso mixture using LMC

Figure 3: Gaussian-Lasso mixture using LMC with small η

Figure 4: Gaussian-Lasso mixture using MALA

Figure 5: Gaussian-Lasso mixture using MALA with small η

Published in Transactions on Machine Learning Research (06/2023)

7 Conclusions

In this paper, we develop a novel proximal sampling algorithm for distributions with semi-smooth potentials and establish complexity-bound results for the proposed method under the assumption that distributions are log-concave or satisfy LSI or PI. Our proximal algorithm is based on the ASF, which resembles the proximal point method in optimization. Each iteration of the ASF generates a sample from a regularized target distribution by querying the RGO, which is itself a challenging algorithmic task due to the lack of smoothness and possibly convexity. The core to our approach is an eﬃcient realization of the RGO based on rejection sampling. We develop a novel technique to bound the expected number of rejection steps in the RGO, which leads to competitive complexity results to sample from semi-smooth and possibly non-convex potentials.

Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: ﬁrst-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory, pp. 2896 2923. PMLR, 2022.

Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference On Learning Theory, pp. 1777 1798. PMLR, 2018.

Nawaf Bou-Rabee and Martin Hairer. Nonasymptotic mixing of the MALA algorithm. IMA Journal of Numerical Analysis, 33(1):80 110, 2013.

Niladri Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1716 1726. PMLR, 2020.

Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp. 2984 3014. PMLR, 2022.

Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen, and Matthew Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. ar Xiv preprint ar Xiv:2112.12662, 2021.

James B Clarage, Tod Romo, B Kim Andrews, B Montgomery Pettitt, and George N Phillips. A sampling problem in molecular dynamics simulations of macromolecules. Proceedings of the National Academy of Sciences, 92(8):3288 3292, 1995.

Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017.

Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diﬀusions. Bernoulli, 26(3):1956 1988, 2020.

Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of discretized langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1 38, 2022.

Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1):37 75, 2014.

Alain Durmus, Eric Moulines, and Marcelo Pereyra. Eﬃcient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018.

Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666 2711, 2019.

Published in Transactions on Machine Learning Research (06/2023)

Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness. In Conference on Learning Theory, pp. 1776 1822. PMLR, 2021.

Yoav Freund, Yi-An Ma, and Tong Zhang. When is the convergence time of langevin algorithms dimension independent? a composite optimization viewpoint. Journal of Machine Learning Research, 23(214):1 32, 2022.

Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. CRC press, 2013.

Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721 741, 1984.

Ulf Grenander and Michael I Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994.

Pavel Izmailov, Sharad Vikram, Matthew D Hoﬀman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In International Conference on Machine Learning, pp. 4629 4640. PMLR, 2021.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Igor Kononenko. Bayesian neural networks. Biological Cybernetics, 61(5):361 370, 1989.

Werner Krauth. Statistical mechanics: algorithms and computations, volume 13. OUP Oxford, 2006.

Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted Gaussian oracle. In Conference on Learning Theory, pp. 2993 3050. PMLR, 2021.

Joseph Lehec. The langevin monte carlo algorithm in the non-smooth log-concave case. Available on ar Xiv:2101.10695, 2021.

Jiaming Liang and Yongxin Chen. A proximal algorithm for sampling from non-smooth potentials. In 2022 Winter Simulation Conference (WSC), pp. 3229 3240, 2022.

Tung Duy Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distributions through Langevin diﬀusion. Methodology and Computing in Applied Probability, 23(4):1173 1201, 2021.

Tatiana Maximova, Ryan Moﬀatt, Buyong Ma, Ruth Nussinov, and Amarda Shehu. Principles and overview of sampling methods for modeling macromolecular structure and dynamics. PLo S computational biology, 12(4):e1004619, 2016.

Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. An eﬃcient sampling algorithm for non-smooth composite potentials. Journal of Machine Learning Research, 23(233):1 50, 2022a.

Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved bounds for discretization of langevin diﬀusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577 1601, 2022b.

Radford M Neal. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.

Yu Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381 404, 2015.

Dao Nguyen, Xin Dang, and Yixin Chen. Unadjusted Langevin algorithm for non-convex weakly smooth potentials. ar Xiv preprint ar Xiv:2101.06369, 2021.

Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014.

Published in Transactions on Machine Learning Research (06/2023)

Giorgio Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378 384, 1981.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674 1703. PMLR, 2017.

Alfréd Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1. Berkeley, California, USA, 1961.

Gareth O Roberts and Osnat Stramer. Langevin diﬀusions and Metropolis-Hastings algorithms. Methodology and Computing in Applied Probability, 4(4):337 357, 2002.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Adil Salim and Peter Richtárik. Primal dual interpretation of the proximal stochastic gradient Langevin algorithm. Advances in Neural Information Processing Systems, 33:3786 3796, 2020.

Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a Restricted Gaussian Oracle. Available on ar Xiv:2006.05976, 2020.

Jack W Sites Jr and Jonathon C Marshall. Delimiting species: a renaissance issue in systematic biology. Trends in Ecology & Evolution, 18(9):462 470, 2003.

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suﬃces. Advances in neural information processing systems, 32, 2019.

Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.

Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pp. 2093 3027. PMLR, 2018.

Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. ar Xiv preprint ar Xiv:1911.01469, 2019.

Published in Transactions on Machine Learning Research (06/2023)

1 Introduction 1

2 Problem formulation and Background 3

3 Proximal sampling for non-convex and semi-smooth potentials 5

4 Proximal sampling for non-convex composite potentials 8

5 Proximal sampling for convex composite potentials 9

6 Computational results 10

7 Conclusions 12

A Technical results 16

B Missing proofs 18

B.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Proof of Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C Sampling from regularized convex semi-smooth potentials 19

D Solving the optimization problem 21

E Approximate implementations of ASF 24

Published in Transactions on Machine Learning Research (06/2023)

A Technical results

This section collects the deﬁnition of Frechet subdiﬀerential and a few technical results that are useful for the analysis of RGO in Section 3.

Deﬁnition A.1 (Frechet subdiﬀerential) Let f : Rn R { } be a proper closed function, then the Frechet subdiﬀerential is deﬁned as

f(x) = v Rn : lim inf y x f(y) f(x) v, y x

The ﬁrst result is the well-known Gaussian integral.

Lemma A.2 A useful Gaussian integral: for any η > 0, Z

2η x 2 dx = (2πη)d/2.

The following lemma shows that f η y as in (6) is close to a strongly convex and smooth function when η is small. This result is used in Proposition 3.2.

Lemma A.3 Let f η y := f + 1 2η y 2 and (f η y ) := f + 1

η( y), then we have for every u, v Rd,

η M u v 2 (1 α)δ

2 f η y (u) f η y (v) (f η y ) (v), u v

η + M u v 2 + (1 α)δ

Proof: Using the deﬁnitions of f η y and (f η y ) , we have

f η y (u) f η y (v) (f η y ) (v), u v

=f(u) f(v) f (v), u v + 1

=f(u) f(v) f (v), u v + 1

The lemma now follows from above identity and (4).

The next lemma gives equivalent formulas of R exp( hw 1,y(x))dx and R exp( hw 2,y(x))dx that are useful in Proposition 3.4.

Lemma A.4 Recall w and w are deﬁned in Proposition 3.2 and Lemma 3.3, respectively. Let hw 1,y and hw 2,y be as in (9) and (10), respectively. Then, we have the following integrals

Z exp( hw 1,y(x))dx = 2πη 1 ηM

d/2 exp (H1(w)) , (21)

Z exp( hw 2,y(x))dx = 2πη 1 + ηM

d/2 exp (H2(w )) , (22)

2η w 2 + η 2(1 ηM) s 2 f(w) + 1

2η y 2 + 1 α

2η w 2 f(w ) + f (w ), w 1

Published in Transactions on Machine Learning Research (06/2023)

Proof: We ﬁrst rewrite hw 1,y and hw 2,y as follows

hw 1,y(x) =f(w) + f (w), x w M

2 x w 2 + 1

2η x y 2 (1 α)δ

x + ηM 1 ηM w 1 1 ηM y + η 1 ηM f (w)

1 2η(1 ηM) ηMw y + ηf (w) 2

+ f(w) f (w), w M

hw 2,y(x) =f(w ) + f (w ), x w + M

2 x w 2 + 1

2η x y 2 + (1 α)δ

x ηM 1 + ηM w 1 1 + ηM y + η 1 + ηM f (w )

1 2η(1 + ηM) ηMw + y ηf (w ) 2

+ f(w ) f (w ), w + M

2η y 2 + 1 α

It follows from (25) and Lemma A.2 that

Z exp( hw 1,y(x))dx = 2πη 1 ηM

d/2 exp( ˆH1(w)), (27)

Z exp( hw 2,y(x))dx = 2πη 1 + ηM

d/2 exp( ˆH2(w )), (28)

ˆH1(w) = 1 2η(1 ηM) ηMw y + ηf (w) 2 f(w) + f (w), w

2η y 2 + 1 α

ˆH2(w ) = 1 2η(1 + ηM) ηMw + y ηf (w ) 2 f(w ) + f (w ), w

It suﬃces to show that ˆH1(w) = H1(w) and ˆH2(w ) = H2(w ) to complete the proof.

Published in Transactions on Machine Learning Research (06/2023)

We ﬁrst verify that ˆH1(w) = H1(w). Using the deﬁnition of ˆH1(w) above and the deﬁnition of s in (7), we have

ˆH1(w) = 1 2η(1 ηM) ηMw w + w y + ηf (w) 2 f(w) + f (w), w

2η y 2 + 1 α

= 1 2η(1 ηM) (ηM 1)w + ηs 2 f(w) + f (w), w

2η y 2 + 1 α

2 w 2 w, s + η 2(1 ηM) s 2 f(w) + f (w), w

2η y 2 + 1 α

2η w 2 + η 2(1 ηM) s 2 f(w) + 1

2η y 2 + 1 α

In view of (23), we verify that ˆH1(w) = H1(w) and hence (21) is proved.

We next verify that ˆH2(w ) = H2(w ). Using the deﬁnition of ˆH2(w ) and (8), we have

ˆH2(w ) = 1 2η(1 + ηM) ηMw + w w + y ηf (w ) 2 f(w ) + f (w ), w

= 1 2η(1 + ηM) (ηM + 1)w 2 M

2 w 2 f(w ) + f (w ), w

2η w 2 f(w ) + f (w ), w 1

In view of (24), we verify that ˆH2(w ) = H2(w ) and hence (22) is proved.

B Missing proofs

B.1 Proof of Lemma 3.1

It follows from the assumption that f is Lα-semi-smooth that for every u, v Rd,

|f(u) f(v) f (v), u v | Lα α + 1 u v α+1. (29)

Using the Young s inequality ab ap/p + bq/q with

a = Lα (α + 1)δ 1 α

2 u v α+1, b = δ 1 α

2 , p = 2 α + 1, q = 2 1 α,

we obtain Lα α + 1 u v α+1 L

2[(α + 1)δ]

1 α α+1 u v 2 + (1 α)δ

Plugging the above inequality into (29), we have

|f(u) f(v) f (v), u v | L

2[(α + 1)δ]

1 α α+1 u v 2 + (1 α)δ

Published in Transactions on Machine Learning Research (06/2023)

This inequality and the deﬁnition of M in (5) imply (4).

B.2 Proof of Proposition 3.2

It follows from Lemma A.3 that f η y satisﬁes (38) with

η + M, θ = (1 α)δ

Since η 1/(Md), it is easy to verify that the assumption on ρ (i.e.,

Md in our case) in Proposition D.4 is satisﬁed. Hence, it follows from Proposition D.4 and (30) that the proposition holds.

C Sampling from regularized convex semi-smooth potentials

This section presents an alternative approach via regularization for sampling in the same setting as in Section 5, i.e., f is convex and satisﬁes (1). More speciﬁcally, the alternative complexity result is obtained by ﬁrst applying Theorem 2.2 on a regularized convex semi-smooth potential and then specifying the regularization parameter.

We consider the following regularized potential with some µ > 0,

ˆf( ) = f( ) + µ

2 x0 2, (31)

which is clearly µ-strongly convex by construction. Hence, exp( ˆf) satisﬁes µ-LSI and Theorem 2.2 is applicable. Since ˆf is µ-strongly convex, improved versions of results in Section 4 can be obtained. We omit the proofs since they can be easily obtained by following similar ideas in the proofs given in Section 3.

Lemma C.1 Assume f is a convex function and satisﬁes (1), then for δ > 0 and every u, v Rd, we have

ˆf(u) ˆf(v) ˆf (v), u v Mn + µ

ˆf(u) ˆf(v) ˆf (v), u v µ

where ˆf and Mn are as in (31) and (17), respectively.

Proposition C.2 Assume η 1 Mnd, and let w Rd be an approximate stationary point of

ˆf η y (x) := ˆf(x) + 1

2η x y 2 , (32)

Mnd, s = ˆf (w) + 1

η (w y), (33)

where Mn is as in (17). Then, the iteration-complexity to ﬁnd w by using Algorithm 3 is O(1).

Lemma C.3 Let w Rd be a stationary point of ˆf η y , i.e.,

ˆf (w ) + 1

η (w y) = 0.

ˆhw 1,y(x) := ˆf(w) + ˆf (w), x w + µ

2 x w 2 + 1

ˆhw 2,y(x) := ˆf(w ) + ˆf (w ), x w + Mn + µ

2 x w 2 + 1

Published in Transactions on Machine Learning Research (06/2023)

where w is as in (33). Then, we have for every x Rd,

ˆhw 1,y(x) ˆf η y (x) ˆhw 2,y(x).

Proposition C.4 Assume f is convex and satisﬁes (1) and let ˆf η y be as in (32). Then X generated by Algorithm 2 (with (7), f η y , and hw 1,y(x) replaced by (33), ˆf η y , and ˆhw 1,y(x), respectively) follows the distribution

πX|Y (x | y) exp ˆf η y (x) . Moreover, if η 1 Mnd, then the expected number of rejection steps in Algorithm

2 is at most exp Pn i=1 (1 αi)δ

Proposition C.5 Assume f is convex and satisﬁes (1), and let ˆf be as in (31). With initial distribution ρX 0

and stepsize η 1/ Pn i=1 L

2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO achieves ε error in terms

of KL divergence with respect to exp( ˆf) in

2 αi+1 αi d µ

iteration, and each iteration queries O(1) subgradient oracle of f and O(1) Gaussian distribution sampling oracle.

Proof: By construction, ˆf is µ-strongly convex and exp( ˆf) satisﬁes µ-LSI. Using Theorem 2.2, starting from initial distribution ρX 0 , it takes 1 2µη log HπX(ρX 0 ) ε (35)

iterations for Algorithm 1 to achieve ε error to exp( ˆf) with respect to KL divergence.

By Propositions C.2 and C.4, Algorithm 2 queries O(1) subgradient oracle of f and O(1) Gaussian distribution sampling oracle. Complexity (34) then follows by plugging η into (35).

Building upon Proposition C.5 for sampling from exp( ˆf(x)) = exp( f(x) µ x x0 2/2), we establish a complexity result, in the following theorem, to sample from the original target distribution πX exp( f) by choosing a proper regularization constant µ.

Theorem C.6 Let πX exp( f(x)) be the target distribution where f is convex and satisﬁes (1). Let x0 Rd and ε > 0 be given and set

2 M4 + x0 xmin 2 (36)

where M4 = R

Rd x xmin 4dπX(x) and xmin Argmin {f(x) : x Rd}. With initial distribution ρX 0

and stepsize η 1/ Pn i=1 L

2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO for step 1, applied to

ν exp( ˆf) = exp( f µ x0 2/2) has the iteration-complexity bound

2 αi+1 αi d M4 + x0 xmin 2

to achieve ε error to πX in terms of total variation.

Proof: Let ρX denote the distribution of the samples generated by Algorithm 1 using Algorithm 2 as an RGO. Following the proof of Corollary 4.1 of Chatterji et al. (2020), we obtain

ρX πX TV ρX ν TV + ν πX TV

Published in Transactions on Machine Learning Research (06/2023)

Rd[f(x) ˆf(x)]2dπX(x) 1/2 = 1

2 x x0 2 2 dπX(x) 1/2

x xmin 2 + xmin x0 2 2 dπX(x) 1/2

2 x xmin 4 + 2 xmin x0 4 dπX(x) 1/2

2µ 2 M4 + xmin x0 4 1/2

M4 + x0 xmin 2 = ε

2 where the last identity is due to the deﬁnition of µ in (36). Hence, it suﬃces to derive the iteration-complexity bound for Algorithm 1 to achieve ρX πX TV ε/2, which is (37) in view of Proposition C.5 with µ as in (36). Note that even though the complexity result in Proposition C.5 is with respect to KL divergence, one can get the same order of complexity with respect to total variation using Pinsker inequality.

D Solving the optimization problem

In this section, we use Nesterov s acceleration to establish the iteration-complexity for solving a general optimization problem that is nearly strongly convex and nearly smooth. This general result is then applied in Section 3 to ﬁnd an approximate stationary point of f η y (see Proposition 3.2).

We consider the optimization problem min{g(x) : x Rd}, where g satisﬁes

2 u v 2 θ g(u) g(v) g (v), u v L

2 u v 2 + θ, u, v Rd, (38)

for some given θ > 0, µ 0, and L 0. We use Nesterov s accelerated gradient method to ﬁnd a ρapproximate stationary point w such that g (w) ρ. We also establish the iteration-complexity to obtain a ρ-approximate stationary point.

Algorithm 3 Accelerated Gradient Method

0. Let an initial point x0, parameters L, µ > 0 be given, and set y0 = x0, A0 = 0, τ0 = 1, and k = 0; 1. Compute

ak = τk + p

τ 2 k + 4τk LAk

2L , Ak+1 = Ak + ak, (39)

τk+1 = τk + akµ, xk = Akyk + akxk

Ak+1 ; (40)

yk+1 := argmin u Rd

2 u xk 2 , (41)

xk+1 := argmin u Rd

n akγk(u) + τk

2 u xk 2o , (42)

where γk(u) := g( xk) + g ( xk), u xk + µ

2 u xk 2; (43)

3. Set k k + 1 and go to step 1.

Lemma D.1 For every k 0 and u Rd, we have

θ g(u) γk(u) L µ

2 u xk 2 + θ. (44)

Published in Transactions on Machine Learning Research (06/2023)

Proof: This lemma directly follows from (38) and the deﬁnition of γk in (43).

Lemma D.2 For every k 0, the following statements hold:

a) Ak+1 = La2 k/τk;

c) Pk i=0 Ai+1 exp(2(k+1)(β 1)/β) 1

L(β2 1) where β = 1 + µ 2

Proof: a) This statement directly follows from (39).

b) It is easy to see from (39), A0 = 0 and τ0 = 1 that A1 = 1/L. Using the deﬁnitions of Ak and τk in (39) and (40), respectively, and the facts that A0 = 0 and τ0 = 1, we easily derive that

τk = τ0 + Akµ = 1 + Akµ. (45)

It follows from (39) that

Ak+1 = Ak + ak = Ak + τk + p

τ 2 k + 4τk LAk

The above inequality and (45) imply

Ak + 1 + Akµ

This statement now follows from the above relation and the fact that A1 = 1/L.

c) Noting from b) that Ak+1 β2k/L, which together with the fact x exp((x 1)/x) for x 1, implies that k X

i=0 β2i = β2(k+1) 1

L(β2 1) exp(2(k + 1)(β 1)/β) 1

Lemma D.3 For every k 0, deﬁne

tk(u) = Ak [g(yk) g(u)] + τk

2 u xk 2, (46)

then for every u Rd, we have

2 Ak+1 yk+1 xk 2 tk(u) tk+1(u) + 2Ak+1θ. (47)

Proof: Using the fact γk is convex and the deﬁnition of xk in (40), we have

Akγk(yk) + akγk(u) + τk

+ τk A2 k+1 2a2 k

Ak+1 min γk(u) + L

γk(yk+1) + L

2 yk+1 xk 2 ,

Published in Transactions on Machine Learning Research (06/2023)

where the ﬁrst identity is due to (39) and the second identity is due to the deﬁnition of yk+1 in (41). It follows from the second inequality of (44) with u = yk+1 and the above inequality that

Ak+1 h g(yk+1) θ + µ

2 yk+1 xk 2i

γk(yk+1) + L

2 yk+1 xk 2

Akγk(yk) + akγk(xk+1) + τk

2 xk+1 xk 2

Akγk(yk) + akγk(u) + τk

2 u xk 2 τk+1

where the last inequality is due to (42) and the fact that akγk + τk xk 2/2 is τk+1-strongly convex. Rearranging the terms in the above inequality, we obtain µ

2 Ak+1 yk+1 xk 2

Akγk(yk) + akγk(u) + τk

2 u xk 2 τk+1

2 u xk+1 2 Ak+1 [g(yk+1) θ]

=Ak [g(yk) g(u)] + τk

2 u xk 2 Ak+1 [g(yk+1) g(u)] τk+1

+ Ak [γk(yk) g(yk)] + ak [γk(u) g(u)] + Ak+1θ

tk(u) tk+1(u) + 2Ak+1θ

where the identity is due to the fact that Ak+1 = Ak + ak, and the last inequality is due to (46) and the ﬁrst inequality of (44).

Proposition D.4 If ρ 2

θ/ µ, then the number of iterations k0 to obtain a ρ-approximate stationary point of g is at most

L + µ 2 µ log

(µ + L)2d2 0 ρ2 2

L + µ 2 µ + 1

Proof: It follows from the optimality condition of (41) that

g ( xk) = (µ + L)( xk yk+1).

Using the above relation and summing (47) with u = x from k = 0 to k 1, we have

µ 2(µ + L)2

i=0 Ai+1 g ( xi) 2 = µ

i=0 Ai+1 yi+1 xi 2

i=0 Ai+1θ = d2 0 2 + 2

where the last identity follows from the facts that A0 = 0 and τ0 = 1. The above inequality and the assumption on ρ imply that

min 0 i k 1 g ( xi) 2 (µ + L)2

d2 0 Pk 1 i=0 Ai+1 + 4θ

µ d2 0 Pk 1 i=0 Ai+1 + ρ2

In order to show min0 i k 1 g ( xi) ρ, it suﬃces to show

µ d2 0 Pk 1 i=0 Ai+1 ρ2

Using Lemma D.2 (c) and the fact that k k0 where k0 is as in (48), we have

i=0 Ai+1 exp(2k(β 1)/β) 1

L(β2 1) 2(µ + L)2d2 0 µρ2 ,

and hence (49) is proved.

Published in Transactions on Machine Learning Research (06/2023)

E Approximate implementations of ASF

The implementation of RGO in Algorithm 2 is exact, hence the samples of both RGO and ASF are unbiased. In contrast, it is shown in Theorem 1 of Wibisono (2019) (resp., Theorem 2 of Vempala & Wibisono (2019)) that the proximal Langevin algorithm (PLA) (resp., LMC) is biased. From the perspective of proximal sampling, we give an explanation of this fact by showing that PLA and its equivalent method, namely, proximal Langevin Monte Carlo (PLMC) of Bernton (2018), can be viewed as an instance of ASF whose implementation of RGO is inexact.

Assume the potential function f is convex and smooth. Recall that PLMC iteratively generates samples as follows: given yk Rd, then the next sample yk+1 takes the form of yk+1 = proxηf(yk) + 2ηz where z N(0, I). It is easy to verify that the following algorithm gives an equivalent form of PLMC.

Algorithm 4 Proximal Langevin Monte Carlo (Bernton, 2018)

1. Sample xk exp[ 1

2η x proxηf(yk) 2] 2. Sample yk+1 πY |X(y | xk) exp[ 1

Next, we show that PLMC is an approximate implementation of ASF. Similar to ASF, PLMC also alternates between steps 1 and 2. More speciﬁcally, step 2 of PLMC plays the same role as step 1 of ASF, and step 1 of PLMC can be viewed as sampling from the proposal density exp[ h1(x)] without rejection, where h1(x) := f η yk(proxηf(yk)) + 1

2η x proxηf(yk) 2 f η yk(x). Since f η yk(x) is the potential function of the RGO in step 2 of ASF. Hence, step 1 of PLMC is an approximate implementation of the RGO. As a result, both PLMC and PLA are approximate implementations of ASF, and thus they are biased.

Following a similar argument, we show that LMC can be viewed as an instance of ASF whose implementation of RGO is inexact. Assume f in the target distribution π exp( f) is convex and smooth and recall that the iterative step in LMC can be described as

yk+1 = yk η f(yk) + p

2ηz, z N(0, I). (50)

We claim that the following algorithm gives an equivalent form of LMC (50) from the proximal sampling perspective.

Algorithm 5 Langevin Monte Carlo

1. Sample yk πY |X(y | xk) exp[ 1

2η xk y 2] 2. Sample xk+1 exp[ 1

2η x yk + η f(yk) 2]

Indeed, steps 1 and 2 can be equivalently written as

xk+1 = yk η f(yk) + ηzk, zk N(0, I),

yk+1 = xk+1 + ηz k, z k N(0, I),

where yk+1 is the sample from step 1 in the next iteration. Combining the above identities, we have

yk+1 = yk η f(yk) + η(zk + z k) d= yk η f(yk) + p

2ηz, z N(0, I).

Moreover, LMC and ASF share the same step 1, and step 2 of LMC equivalently generates xk+1 from exp[ h1(x)] where

h1(x) := f(yk) + f(yk), x yk + 1

2η x yk 2. (51)

Using the deﬁnition of h1 in (51) and the convexity of f, we have

h1(x) f(x) + 1

2η x yk 2 = f η yk(x).

Published in Transactions on Machine Learning Research (06/2023)

Note that f η yk(x) is the potential function of the RGO in step 2 of ASF. Hence, step 2 of LMC can be interpreted as an RGO implementation with the proposal density exp[ h1(x)] but without rejection. As a result, LMC is an approximate implementation of ASF and thus LMC is biased.

It is worth noting that many other sampling algorithms, for example, symmetric Langevin algorithm of Wibisono (2018) and Metropolis-adjusted proximal gradient Langevin dynamics of Mou et al. (2022a), can also be shown to be approximate implementations of ASF in an analogous way.