# a_proximal_algorithm_for_sampling__10ca0fb5.pdf Published in Transactions on Machine Learning Research (06/2023) A Proximal Algorithm for Sampling Jiaming Liang jiaming.liang@yale.edu Department of Computer Science, Yale University, New Haven, CT 06511. Yongxin Chen yongchen@gatech.edu School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332. Reviewed on Open Review: https: // openreview. net/ forum? id= Ck XOwlhf27 We study sampling problems associated with potentials that may lack smoothness or convexity. Departing from the standard smooth setting, the potentials are only assumed to be weakly smooth or non-smooth, or the summation of multiple such functions. We develop a sampling algorithm that resembles proximal algorithms in optimization for this challenging sampling task. Our algorithm is based on a special case of Gibbs sampling known as the alternating sampling framework (ASF). The key contribution of this work is a practical realization of the ASF based on rejection sampling for both non-convex and convex potentials that are not necessarily smooth. In almost all the cases of sampling considered in this work, our proximal sampling algorithm achieves a better complexity than all existing methods. 1 Introduction The problem of drawing samples from an unnormalized probability distribution plays an essential role in data science and scientific computing (Durmus et al., 2018; Clarage et al., 1995; Maximova et al., 2016). It has been widely used in many areas such as Bayesian inference, Bayesian neural networks, probabilistic graphical models, biology, and machine learning (Gelman et al., 2013; Kononenko, 1989; Koller & Friedman, 2009; Krauth, 2006; Sites Jr & Marshall, 2003; Durmus et al., 2018). Compared with optimization oriented methods, sampling has the advantage of being able to quantify the uncertainty and confidence level of the solution, and often provides more reliable solutions to engineering problems. This advantage comes at the cost of higher computational cost. It is thus important to develop more efficient sampling algorithms. In the classical setting of sampling, the potential function f of an unnormalized target distribution exp( f(x)) is assumed to be smooth and (strongly) convex. Over the past decades, many sampling algorithms have been developed, including Langevin Monte Carlo (LMC), kinetic Langevin Monte Carlo (KLMC), Hamiltonian Monte Carlo (HMC), Metropolis-adjusted Langevin algorithm (MALA), etc (Dalalyan, 2017; Grenander & Miller, 1994; Parisi, 1981; Roberts & Tweedie, 1996; Dalalyan & Riou-Durand, 2020; Bou-Rabee & Hairer, 2013; Roberts & Stramer, 2002; Roberts & Tweedie, 1996; Neal, 2011). Many of these algorithms are based on some type of discretization of the Langevin diffusion or the underdamped Langevin diffusion. They resemble the gradient-based algorithms in optimization. These algorithms work well for (strongly) convex and smooth potentials; many non-asymptotic complexity bounds have been proven. However, the cases where either the convexity or the smoothness is lacking are much less understood (Chewi et al., 2021; Chen et al., 2022; Chatterji et al., 2020; Erdogdu & Hosseinzadeh, 2021; Liang & Chen, 2022; Dalalyan et al., 2022; Balasubramanian et al., 2022; Mou et al., 2022a). In this paper, we consider the challenging task of sampling from potentials that are not smooth and even not convex. Many sampling problems in real applications fall into this setting. For instance, Bayesian neural networks are highly non-convex models corresponding to probability densities with multi-modality (Izmailov et al., 2021). The lack of smoothness is due to the use of activation functions such as Re LU. The goal of this work is to develop an efficient algorithm with provable guarantees for a class of potentials that are not smooth. In particular, we consider potential functions that are semi-smooth, defined by (1). Published in Transactions on Machine Learning Research (06/2023) Inspired by the recent line of research that lies in the interface of sampling and optimization, we examine this sampling task from an optimization perspective. We build on the intuition of proximal algorithms for non-smooth optimization problems and develop a proximal algorithm to sample from non-convex and semi-smooth potentials. Our algorithm is based on the alternating sampling framework (ASF) (Lee et al., 2021) developed recently to sample from strongly convex potentials. In a nutshell, the ASF is a Gibbs sampler over a carefully designed augmented distribution of the target and one can thus sample from the target distribution by sampling from the augmented distribution. The convergence results of ASF have been recently improved (Chen et al., 2022) to cover non-log-concave distributions that satisfy functional inequalities such as the Logarithmic Sobolev inequality (LSI) and the Poincaré inequality (PI). The ASF is an idealized algorithm that is not directly implementable. In each iteration, it needs to query the so-called restricted Gaussian oracle (RGO), which is itself a sampling task from a quadratically regularized distribution exp( f(x) 1 2η x y 2) for some given η > 0 and y Rd. The RGO can be viewed as a sampling counterpart of the proximal map in optimization. The total complexity of ASF thus depends on that of the RGO. Except for a few special cases where f has certain structures, the RGO is usually a challenging task. One key contribution of this work is a practical and efficient algorithm for RGO for potentials that are neither smooth nor convex. This algorithm extends the recent work Liang & Chen (2022) for convex and non-smooth potentials. Combining the ASF and our algorithm for RGO, we establish a proximal algorithm for sampling from (convex or non-convex) semi-smooth potentials. Note that the techniques used in Liang & Chen (2022) are no longer applicable to these non-convex semi-smooth settings. In this work, we developed a new algorithm and associated proof techniques for efficient RGO implementation. Our contributions are summarized as follows. i) We develop an efficient sampling scheme of RGO for semismooth potentials which can be either convex or non-convex and bound its complexity with a novel technique. ii) We combine our RGO scheme and the ASF to develop a sampling algorithm that can sample from both convex and non-convex semi-smooth potentials. Our algorithm has a better non-asymptotic complexity than almost all existing methods in the same setting. iii) We further extend our algorithms for potentials that are the summation of multiple semi-smooth functions. Table 1: Complexity bounds for sampling from non-convex potentials. Source Complexity Assumption Metric Chewi et al. (2021) O C1+1/α PI L2/α α d2+1/α weakly smooth PI, α > 0 this paper (Thm. 3.5) O CPIL2/(1+α) α d2 semi-smooth PI Rényi Nguyen et al. (2021) C1+max{1/αi} LSI n max{L2 αi}d ε αi > 0, composite semi-smooth, LSI this paper (Thm. 4.4) O CLSI Pn i=1 L2/(αi+1) αi d composite semi-smooth, LSI this paper (Thm. 4.5) O CPI Pn i=1 L2/(αi+1) αi d composite semi-smooth, PI Related works: MCMC sampling from non-convex potentials has been investigated in Raginsky et al. (2017); Vempala & Wibisono (2019); Wibisono (2019); Chewi et al. (2021); Erdogdu & Hosseinzadeh (2021); Mou et al. (2022b); Luu et al. (2021). There have also been some works on sampling without smoothness Lehec (2021); Durmus et al. (2019); Chatterji et al. (2020); Chewi et al. (2021); Mou et al. (2022a); Shen et al. (2020); Durmus et al. (2018); Freund et al. (2022); Salim & Richtárik (2020); Bernton (2018); Nguyen et al. (2021); Liang & Chen (2022). The literature for the case where the potential function lacks both convexity and smoothness is rather scarce. In Nguyen et al. (2021); Erdogdu & Hosseinzadeh (2021), the authors analyze the convergence of LMC for weakly smooth potentials that satisfy a dissipativity condition. The target distribution is assumed to satisfy some functional inequality. Erdogdu & Hosseinzadeh (2021) also assumes an additional tail growth condition. The dissipativity condition is removed in Chewi et al. Published in Transactions on Machine Learning Research (06/2023) (2021). The results in Nguyen et al. (2021) are applicable to potentials that are the summation of multiple weakly smooth functions, while those in Chewi et al. (2021); Erdogdu & Hosseinzadeh (2021) are not. To compare our results with them, we make the simplification that the initial distance, either in KL or Rényi divergence, to the target distribution is O(d). The results in cases with non-convex potentials are presented in Table 1. It can be seen that our complexities have better dependence on all the parameters: LSI constant CLSI, PI constant CPI, weakly smooth coefficients Lα, and dimension d. Moreover, our complexity bounds depend polylogarithmically on the accuracy ε (thus ε does not appear in the O notation) while all the other results have polynomial dependence on 1/ε. 2 Problem formulation and Background We are interested in sampling from distributions with potentials that are not necessarily smooth. More specifically, we consider the sampling task with the target distribution ν exp( f(x)) where the potential f satisfies f (u) f (v) i=1 Lαi u v αi, u, v Rd (1) for some αi [0, 1] and Lαi > 0, 1 i n. Here f denotes a subgradient of f in the Frechet subdifferential (see Definition A.1). When n = 1 and α1 = 1, it is well-known that f is a smooth function. When n = 1 and 0 < α1 < 1, f satisfying (1) is known as a weakly smooth function. When n = 1 and α1 = 0, f satisfying (1) is a non-smooth function. Thus, the cases we consider cover all three cases: smooth, weakly-smooth, and non-smooth. For ease of reference, we termed a function satisfying the condition (1) a semi-smooth function. The target distribution ν is assumed to satisfy LSI or PI, but f can be non-convex. The class of distributions we consider have been studied in (Chatterji et al., 2020; Chewi et al., 2021; Nguyen et al., 2021) for MCMC sampling. One example of such a distribution is ν exp( A2σ(A1x) b 2 x 2 x ) for some full rank matrices A1, A2, vector b, and some activation (e.g., Re LU) function σ. This is a Bayesian regression problem. In this work, we aim to develop a proximal algorithm for sampling from non-convex potentials that satisfy (1). Our method is built on the alternating sampling framework (ASF) introduced in Lee et al. (2021). Unlike most existing sampling algorithms that require the potential to be smooth, ASF is applicable to semi-smooth problems. Initialized at x0 ρX 0 , ASF with target distribution πX(x) exp( f(x)) and stepsize η > 0 performs the two alternating steps as follows. Algorithm 1 Alternating Sampling Framework (Lee et al., 2021) 1. Sample yk πY |X(y | xk) exp[ 1 2η xk y 2] 2. Sample xk+1 πX|Y (x | yk) exp[ f(x) 1 2η x yk 2] The ASF is a special instance of the Gibbs sampling (Geman & Geman, 1984) from π(x, y) exp f(x) 1 2η x y 2 . (2) In Algorithm 1, sampling yk given xk in step 1 is easy since πY |X(y | xk) = N(xk, ηI) is an isotropic Gaussian distribution. Sampling xk+1 given yk in step 2 is however a highly nontrivial task; it corresponds to the restricted Gaussian oracle for f (Lee et al., 2021), defined as follows. Definition 2.1 Given a point y Rd and stepsize η > 0, the restricted Gaussian oracle (RGO) for f : Rd R is a sampling oracle that returns a random sample from a distribution proportional to exp( f( ) y 2/(2η)). The RGO can be viewed as the sampling counterpart of the proximal map in optimization that is widely used in proximal algorithms for optimization (Parikh & Boyd, 2014). The ASF is an idealized algorithm; Published in Transactions on Machine Learning Research (06/2023) an efficient implementation of the RGO is crucial to use this framework in practice. For some special cases of f, the RGO admits a computationally efficient realization (Mou et al., 2022a; Shen et al., 2020; Liang & Chen, 2022). For general f, especially semi-smooth ones considered in this work, it was not clear how to realize the RGO efficiently. Under the assumption that the RGO in the ASF can be efficiently realized, the ASF exhibits surprising convergence properties. It was firstly established in Lee et al. (2021) that, when f is strongly convex, Algorithm 1 converges linearly. This result is further improved recently in Chen et al. (2022) under various weaker assumptions on the target distribution πX exp( f). We summarize below several convergence results established in Chen et al. (2022) that will be used. Throughout the paper, we abuse notation by identifying a probability measure with its density w.r.t. Lebesgue measure. To this end, for two probability distributions ρ ν, we denote by Hν(ρ) := Z ρ log ρ ν , χ2 ν(ρ) := Z ρ2 ν 1, Rq,ν(ρ) := 1 q 1 log Z ρq the KL divergence, the Chi-squared divergence, and the Rényi divergence, respectively. Note that R2,ν = log(1+χ2 ν), R1,ν = Hν, and Rq,ν Rq ,ν for any 1 q q < (Rényi, 1961; Vempala & Wibisono, 2019). We denote by W2 the Wasserstein-2 distance (Villani, 2021). Recall a probability distribution ν satisfies LSI with constant CLSI > 0 (1/CLSI-LSI) if for every ρ, Hν(ρ) CLSI 2 Jν(ρ), where the Fisher information Jν(ρ) is defined as Jν(ρ) = Eρ[ log ρ ν 2]. A probability distribution ν satisfies PI with constant CPI > 0 (1/CPI-PI) if for any smooth bounded function ψ : Rd R, we have Eν[(ψ Eν(ψ))2] CPIEν[ ψ 2]. Theorem 2.2 ((Chen et al., 2022, Theorem 3)) Assume that πX exp( f) satisfies λ-LSI. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisfies HπX(ρX k ) HπX(ρX 0 ) (1 + λη)2k . Moreover, for all q 1, Rq,πX(ρX k ) Rq,πX(ρX 0 ) (1 + λη)2k/q . Theorem 2.3 ((Chen et al., 2022, Theorem 4)) Assume πX exp( f) satisfies λ-PI. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisfies χ2 πX(ρX k ) χ2 πX(ρX 0 ) (1 + λη)2k . Moreover, for all q 2, Rq,πX(ρX k ) ( Rq,πX(ρX 0 ) 2k log(1+λη) q , if k q 2 log(1+λη) (Rq,πX(ρX 0 ) 1) , 1/(1 + λη)2(k k0)/q , if k k0 := q 2 log(1+λη) (Rq,πX(ρX 0 ) 1) . Theorem 2.4 ((Chen et al., 2022, Theorem 2)) Assume that πX exp( f) is log-concave. For any initial distribution ρX 0 , the k-th iterate ρX k of Algorithm 1 with step size η > 0 satisfies HπX(ρX k ) W 2 2 (ρX 0 , πX) kη . To use the ASF for sampling problems, we need to realize the RGO with efficient implementations. In the rest of this paper, we develop an efficient algorithm for RGO associated with a potential satisfying (1), and then combine it with the ASF to establish a proximal algorithm for sampling. The complexity of the proximal algorithm can be obtained by combining the above convergence results for ASF and the complexity results we establish for RGO. The rest of the paper is organized as follows. In Section 3 we consider a special Published in Transactions on Machine Learning Research (06/2023) case of (1) with n = 1, develop an efficient implementation for RGO via rejection sampling, and establish complexity results for sampling from distributions with non-convex and semi-smooth potentials. In Section 4, we extend the aforementioned complexity results to the general cases (1). In Section 5, we specialize (1) to the case where f is convex. In Section 6, we present preliminary computational results to demonstrate the efficacy of the proximal sampling algorithm. In Section 7, we present some concluding remarks. Finally, in Appendices A-E, we present technical results and proofs omitted in the paper and provide a self-contained discussion on solving a subproblem in the proximal sampling algorithm. 3 Proximal sampling for non-convex and semi-smooth potentials Our main objective in this section is to establish complexity results for sampling from distributions with non-convex and semi-smooth potentials satisfying (1) with n = 1, i.e., f (u) f (v) Lα u v α, u, v Rd. (3) We refer to such a function an Lα-α-semi-smooth function. The bottleneck of Algorithm 1 for sampling from a general distribution exp( f) is an efficient realization of the RGO, i.e., step 2 of Algorithm 1. To address this issue, we develop an efficient algorithm for the corresponding RGO based on rejection sampling. We show that, with a carefully designed proposal and a sufficiently small η, the expected number of rejection sampling steps to obtain one effective sample in RGO turns out to be bounded above by a dimension-free constant. The core to achieving such a constant bound on the expected rejection steps is a novel construction of proposal distribution that does not rely on the convexity of f and a refined analysis that captures the nature of semi-smooth functions. We utilize a useful property of semi-smooth functions that they can be approximated by smooth functions to arbitrary accuracy, at the cost of increasing their smoothness parameters. This is formalized in the following lemma, of which the proof is postponed to Appendix B. Relevant ideas have been explored in Devolder et al. (2014); Nesterov (2015) to design universal methods for convex semi-smooth optimization problems. Lemma 3.1 Assume f is an Lα-α-semi-smooth function, then for δ > 0 and every u, v Rd |f(u) f(v) f (v), u v | M 2 u v 2 + (1 α)δ 1 α α+1 . (5) Our algorithm is inspired by Liang & Chen (2022), which also uses rejection sampling for RGO. The proposal of rejection sampling used in Liang & Chen (2022) is a Gaussian distribution centered at an (approximate) minimizer of the regularized potential function. We thus consider the regularized optimization problem proxηf(y) := argmin x Rd f η y (x) := f(x) + 1 2η x y 2 , (6) where y Rd is given. Departing from the convex setting studied in Liang & Chen (2022), when f is nonconvex and semi-smooth, (6) may not be a convex optimization regardless of the value of η. Nevertheless, thanks to Lemma 3.1, f η y is close to a strongly convex and smooth function up to some approximation error, and (6) can still be solved efficiently using convex smooth optimization algorithms such as Nesterov s acceleration. We describe a variant of the method in Algorithm 3 in Appendix D. The following proposition presents a complexity result of Algorithm 3 for finding an approximate stationary point of f η y with a small η. Its proof is postponed to Appendix B. Proposition 3.2 Assume η 1 Md, and let w Rd be an approximate stationary point of f η y , i.e., Md, s = f (w) + 1 η (w y), (7) where M is as in (5). Then, the iteration-complexity to find w with Algorithm 3 is O(1). Published in Transactions on Machine Learning Research (06/2023) With this approximate stationary point, we obtain the following key ingredients of our rejection samplingbased RGO. Lemma 3.3 Let w Rd be a stationary point of f η y , i.e., η (w y) = 0, (8) and w be an approximate stationary point as in (7). Define hw 1,y(x) := f(w) + f (w), x w M 2 x w 2 + 1 2η x y 2 (1 α)δ hw 2,y(x) := f(w ) + f (w ), x w + M 2 x w 2 + 1 2η x y 2 + (1 α)δ Then, we have for every x Rd, hw 1,y(x) f η y (x) hw 2,y(x). (11) Proof: Inequalities in (11) directly follow from (4) and the definitions of hw 1,y and hw 2,y in (9) and (10), respectively. We are now ready to present the rejection sampling algorithm (Algorithm 2) for RGO. Algorithm 2 RGO Rejection Sampling 1. Compute an approximate solution w satisfying (7) with Algorithm 3 2. Generate sample X exp( hw 1,y(x)) 3. Generate sample U U[0, 1] 4. If U exp( f η y (X)) exp( hw 1,y(X)), then accept/return X; otherwise, reject X and go to step 2. By construction, the proposal exp( hw 1,y(x)) is close to the target πX|Y (x | y) exp( f η y (x)) when η is sufficiently small, and hence the expected number of rejection steps is small. The following proposition rigorously justifies this intuition. Proposition 3.4 Assume f is Lα-α-semi-smooth and let f η y be as in (6). Then X generated by Algorithm 2 follows the distribution πX|Y (x | y) exp f η y (x) . Moreover, if η 1 Md = [(α + 1)δ] 2 α+1 α d , (12) then the expected number of rejection steps in Algorithm 2 is at most exp 3(1 α)δ Proof: It is well-known in rejection sampling X πX|Y (x|y). By definition, the probability that X is accepted is U exp( f η y (X)) exp( hw 1,y(X)) = Z exp( f η y (x)) exp( hw 1,y(x)) exp( hw 1,y(x)) R exp( hw 1,y(z))dz dx = R exp( f η y (x))dx R exp( hw 1,y(x))dx. Published in Transactions on Machine Learning Research (06/2023) Using the above identity, (11), and Lemma A.4, we have U exp( f η y (X)) exp( hw 1,y(X)) R exp( hw 2,y(x))dx R exp( hw 1,y(x))dx d/2 exp 1 2η w 2 f(w ) + f (w ), w 1 2η y 2 1 α exp 1 2η w 2 + η 2(1 ηM) s 2 f(w) + 1 η w, y w 1 2η y 2 + 1 α d/2 exp (1 α)δ η 2(1 ηM) s 2 2η w 2 f(w ) + f(w) + f (w ), w 1 η w, y w . (13) It follows from (4) with (u, v) = (w, w ) that f(w) + f(w ) + f (w ), w w M 2 w w 2 + (1 α)δ which together with (8) implies that 2η w 2 f(w ) + f(w) + f (w ), w 1 2η w 2 f(w ) + f(w) + f (w ), w 1 η w, ηf (w ) + w w 2η w w 2 f(w ) + f(w) + f (w ), w w 2 w w 2 (1 α)δ where the last inequality is due to (12). Plugging the above inequality into (13), we obtain U exp( f η y (X)) exp( hw 1,y(X)) d/2 exp 3(1 α)δ 2 η 2(1 ηM) s 2 . Hence, using the above bound, (12) and (7), we arrive at the following bound on the expected number of rejection iterations P U exp( f η y (X)) exp( hw 1,y(X)) 1 + ηM d/2 exp 3(1 α)δ 2 + η 2(1 ηM) s 2 1 + 2ηM 1 ηM d/2 exp 3(1 α)δ 2 + η s 2 (1 + 4ηM)d/2 exp 3(1 α)δ d/2 exp 3(1 α)δ 2 + 1 exp 3(1 α)δ Note that δ is a tunable parameter. Choosing a large δ makes M small and η large in view of (5) and (12), respectively. Such a choice results in a better complexity of ASF but a larger number of expected rejection steps in RGO. Combining Proposition (3.4) and the convergence results of ASF we obtain the following nonasymptotic complexity bound to sample from non-convex semi-smooth potentials. This complexity bound is better than all existing results when α [0, 1), see Table 1 for comparison. Theorem 3.5 Assume f is Lα-α-semi-smooth and πX exp( f) satisfies PI with constant CPI. With initial distribution ρX 0 and stepsize η 1/(L 2 α+1 α d), Algorithm 1 using Algorithm 2 as an RGO has the Published in Transactions on Machine Learning Research (06/2023) iteration-complexity bound O CPIL 2 α+1 α d log χ2 πX(ρX 0 ) (14) to achieve ε error to the target πX in terms of Chi-squared divergence, and 2 α+1 α qd Rq,πX(ρX 0 ) , q 2 (15) to achieve ε error in Rényi divergence Rq,πX. Each iteration queries O(1) subgradients of f. Proof: The result is a direct consequence of Theorem 2.3, Proposition 3.2 and Proposition 3.4 with the choice of stepsize η 1/(L 2 α+1 α d). Remark: When the coordinate is scaled by γ as x γx, by definition, the semi-smoothness coefficient becomes γα+1Lα and the PI constant becomes CPI/γ2. By Proposition 3.4, to attain O(1) complexity for the RGO, the stepsize should be η [(α+1)δ] 1 α α+1 (γα+1Lα) 2 α+1 d = [(α+1)δ] 1 α α+1 2 α+1 α d . Thus, applying Theorem 2.3, the total complexity remains the same as that without coordinate scaling. 4 Proximal sampling for non-convex composite potentials This section is devoted to the sampling from distributions with non-convex composite potential f satisfying (1). Results presented in this section generalize those in Section 3, which are for the setting with n = 1. Although Section 3 is developed for the simple case where n = 1 in (1), the proof techniques apply to the general case of (1). Hence, to avoid duplication, we present the following results analogous to those in Section 3 without giving proofs. The following lemma is a direct generalization of Lemma 3.1, which shows that functions satisfying (1) can be approximated by smooth functions up to some controllable approximation errors. Lemma 4.1 Assume f satisfies (1), then for any δ > 0, we have |f(u) f(v) f (v), u v | Mn 2 , u, v Rd, (16) [(αi + 1)δ] 1 αi αi+1 . (17) The next proposition is a counterpart of Proposition 3.2 and gives the complexity of solving optimization problem (6) in the context of (1). Proposition 4.2 Assume η 1/(Mnd), and let wn Rd be an approximate stationary point of f η y such that f (wn) + 1 η(wn y) Mnd. Then, the iteration-complexity to find wn with Algorithm 3 is O(1). Again, the core to the proximal algorithm (Algorithm 1) is an efficient implementation of RGO. We use Algorithm 2 with slight modification as a realization of RGO in the context of (1). First, in step 1 of Algorithm 2, we use Algorithm 3 to compute wn as in Proposition 4.2 instead of w. Second, in steps 2 and 4 of Algorithm 2, instead of using hw 1,y as in (9), we define hw 1,y as follows, hw 1,y(x) = f(w) + f (w), x w Mn 2 x w 2 + 1 where Mn is as in (17). The next proposition is a generalization of Proposition 3.4. Published in Transactions on Machine Learning Research (06/2023) Proposition 4.3 Assume f satisfies (1) and let f η y be as in (6). Then X generated by Algorithm 2 with modification follows the distribution πX|Y (x | y) exp f η y (x) . Moreover, if η 1 Mnd, then the expected number of rejection steps in Algorithm 2 is at most exp 3Pn In the rest of this section, we present two results on the complexity of sampling for non-convex composite potentials where the target distribution satisfies either LSI or PI. Our complexity bounds are better than all existing results when min αi [0, 1), see Table 1 for comparison. Theorem 4.4 Assume f satisfies (1) and πX exp( f) satisfies LSI with constant CLSI. With initial distribution ρX 0 and stepsize η 1/ Pn i=1 L 2 αi+1 αi d , Algorithm 1 using the modified Algorithm 2 as an RGO has the iteration-complexity bound 2 αi+1 αi d to achieve ε error to πX in KL divergence. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from Gaussian distribution.. Moreover, for all q 1, the iteration-complexity bound to achieve ε error to πX in Rényi divergence Rq,πX is 2 αi+1 αi dq Proof: This theorem is a direct consequence of Theorem 2.2 and Propositions 4.2-4.3. Theorem 4.5 Assume f satisfies (1) and πX exp( f) satisfies PI with constant CPI. With initial distribution ρX 0 and stepsize η 1/ Pn i=1 L 2 αi+1 αi d , Algorithm 1 using the modified Algorithm 2 as an RGO has the iteration-complexity bound 2 αi+1 αi d log χ2 πX(ρX 0 ) to achieve ε error to the target πX in terms of Chi-squared divergence, and 2 αi+1 αi qd Rq,πX(ρX 0 ) to achieve ε error in Rényi divergence Rq,πX. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from Gaussian distribution. Proof: This theorem immediately follows from Theorem 2.3 and Propositions 4.2-4.3. 5 Proximal sampling for convex composite potentials For completeness, in this section we present a complexity result for sampling from distributions with convex potential f satisfying (1). We only consider the weakly convex setting here as strongly log-concave distributions satisfy LSI and are thus covered by Theorem 4.4. The result is a consequence of Theorem 2.4 for log-concave densities combined with our RGO implementation. In particular, Lemma 4.1 and Propositions 4.2-4.3 apply to this setting. Hence, using Theorem 2.4 and Propositions 4.2-4.3, we directly obtain the following complexity result. Note our result is applicable to composite potentials with any number of components while most existing results are only applicable to composite potentials with at most two components. We also present an alternative result via a regularization approach in Appendix C. Published in Transactions on Machine Learning Research (06/2023) Theorem 5.1 Assume f satisfies (1) and πX exp( f) is log-concave. With initial distribution ρX 0 and stepsize η 1/ Pn i=1 L 2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO has the iteration-complexity W 2 2 (ρX 0 , πX) Pn i=1 L 2 αi+1 αi d ε to achieve ε error to the target πX in terms of KL divergence. Each iteration queries O(1) subgradients of f and generates O(1) samples in expectation from standard Gaussian distribution. Proof: This theorem is a direct consequence of Theorem 2.4 and Propositions 4.2-4.3. 6 Computational results In this section, we present a numerical example to illustrate our result. We consider sampling from a Gaussian-Laplace mixture ν(x) = 0.5(2π) d/2p det Q exp( (x 1) Q(x 1)/2) + 0.5(2d) exp( 4x 1) where Q = USU , d = 5, S = diag(14, 15, 16, 17, 18), and U is an arbitrary orthogonal matrix. We run 500, 000 iterations (with 100, 000 burn-in iterations) for both the proximal sampling algorithm and LMC with η = 1/(Md) where d = 5 and M is as in (5) with (α, Lα) = (1, 27) and δ = 1. Histograms and trace plots (of the 3-rd coordinate) of the samples generated by both methods are presented in Figures 1 and 2. The average numbers of optimization iterations and rejection sampling iterations in Algorithm 2 are 1.5 and 1.3, respectively. In addition, we also run 2, 500, 000 iterations (with 500, 000 burn-in iterations) for LMC with η = 1/(5Md). Figure 3 presents the histogram and trace plot for this LMC. The red curves in histograms are the scaled target density ν(x). Figures 4 and 5 show the results for MALA in the same settings as those of Figures 2 and 3, respectively. In these experiments, we observe that our algorithm achieves better accuracy under the same computational budget constraints. Figure 1: Gaussian-Lasso mixture using the proximal sampling algorithm Published in Transactions on Machine Learning Research (06/2023) Figure 2: Gaussian-Lasso mixture using LMC Figure 3: Gaussian-Lasso mixture using LMC with small η Figure 4: Gaussian-Lasso mixture using MALA Figure 5: Gaussian-Lasso mixture using MALA with small η Published in Transactions on Machine Learning Research (06/2023) 7 Conclusions In this paper, we develop a novel proximal sampling algorithm for distributions with semi-smooth potentials and establish complexity-bound results for the proposed method under the assumption that distributions are log-concave or satisfy LSI or PI. Our proximal algorithm is based on the ASF, which resembles the proximal point method in optimization. Each iteration of the ASF generates a sample from a regularized target distribution by querying the RGO, which is itself a challenging algorithmic task due to the lack of smoothness and possibly convexity. The core to our approach is an efficient realization of the RGO based on rejection sampling. We develop a novel technique to bound the expected number of rejection steps in the RGO, which leads to competitive complexity results to sample from semi-smooth and possibly non-convex potentials. Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory, pp. 2896 2923. PMLR, 2022. Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference On Learning Theory, pp. 1777 1798. PMLR, 2018. Nawaf Bou-Rabee and Martin Hairer. Nonasymptotic mixing of the MALA algorithm. IMA Journal of Numerical Analysis, 33(1):80 110, 2013. Niladri Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In International Conference on Artificial Intelligence and Statistics, pp. 1716 1726. PMLR, 2020. Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. Improved analysis for a proximal algorithm for sampling. In Conference on Learning Theory, pp. 2984 3014. PMLR, 2022. Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen, and Matthew Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. ar Xiv preprint ar Xiv:2112.12662, 2021. James B Clarage, Tod Romo, B Kim Andrews, B Montgomery Pettitt, and George N Phillips. A sampling problem in molecular dynamics simulations of macromolecules. Proceedings of the National Academy of Sciences, 92(8):3288 3292, 1995. Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017. Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 1988, 2020. Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of discretized langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1 38, 2022. Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1):37 75, 2014. Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018. Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666 2711, 2019. Published in Transactions on Machine Learning Research (06/2023) Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness. In Conference on Learning Theory, pp. 1776 1822. PMLR, 2021. Yoav Freund, Yi-An Ma, and Tong Zhang. When is the convergence time of langevin algorithms dimension independent? a composite optimization viewpoint. Journal of Machine Learning Research, 23(214):1 32, 2022. Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. CRC press, 2013. Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721 741, 1984. Ulf Grenander and Michael I Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994. Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are bayesian neural network posteriors really like? In International Conference on Machine Learning, pp. 4629 4640. PMLR, 2021. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. Igor Kononenko. Bayesian neural networks. Biological Cybernetics, 61(5):361 370, 1989. Werner Krauth. Statistical mechanics: algorithms and computations, volume 13. OUP Oxford, 2006. Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted Gaussian oracle. In Conference on Learning Theory, pp. 2993 3050. PMLR, 2021. Joseph Lehec. The langevin monte carlo algorithm in the non-smooth log-concave case. Available on ar Xiv:2101.10695, 2021. Jiaming Liang and Yongxin Chen. A proximal algorithm for sampling from non-smooth potentials. In 2022 Winter Simulation Conference (WSC), pp. 3229 3240, 2022. Tung Duy Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distributions through Langevin diffusion. Methodology and Computing in Applied Probability, 23(4):1173 1201, 2021. Tatiana Maximova, Ryan Moffatt, Buyong Ma, Ruth Nussinov, and Amarda Shehu. Principles and overview of sampling methods for modeling macromolecular structure and dynamics. PLo S computational biology, 12(4):e1004619, 2016. Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. An efficient sampling algorithm for non-smooth composite potentials. Journal of Machine Learning Research, 23(233):1 50, 2022a. Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577 1601, 2022b. Radford M Neal. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011. Yu Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381 404, 2015. Dao Nguyen, Xin Dang, and Yixin Chen. Unadjusted Langevin algorithm for non-convex weakly smooth potentials. ar Xiv preprint ar Xiv:2101.06369, 2021. Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014. Published in Transactions on Machine Learning Research (06/2023) Giorgio Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378 384, 1981. Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674 1703. PMLR, 2017. Alfréd Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1. Berkeley, California, USA, 1961. Gareth O Roberts and Osnat Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and Computing in Applied Probability, 4(4):337 357, 2002. Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996. Adil Salim and Peter Richtárik. Primal dual interpretation of the proximal stochastic gradient Langevin algorithm. Advances in Neural Information Processing Systems, 33:3786 3796, 2020. Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a Restricted Gaussian Oracle. Available on ar Xiv:2006.05976, 2020. Jack W Sites Jr and Jonathon C Marshall. Delimiting species: a renaissance issue in systematic biology. Trends in Ecology & Evolution, 18(9):462 470, 2003. Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019. Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021. Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pp. 2093 3027. PMLR, 2018. Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. ar Xiv preprint ar Xiv:1911.01469, 2019. Published in Transactions on Machine Learning Research (06/2023) 1 Introduction 1 2 Problem formulation and Background 3 3 Proximal sampling for non-convex and semi-smooth potentials 5 4 Proximal sampling for non-convex composite potentials 8 5 Proximal sampling for convex composite potentials 9 6 Computational results 10 7 Conclusions 12 A Technical results 16 B Missing proofs 18 B.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Proof of Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C Sampling from regularized convex semi-smooth potentials 19 D Solving the optimization problem 21 E Approximate implementations of ASF 24 Published in Transactions on Machine Learning Research (06/2023) A Technical results This section collects the definition of Frechet subdifferential and a few technical results that are useful for the analysis of RGO in Section 3. Definition A.1 (Frechet subdifferential) Let f : Rn R { } be a proper closed function, then the Frechet subdifferential is defined as f(x) = v Rn : lim inf y x f(y) f(x) v, y x The first result is the well-known Gaussian integral. Lemma A.2 A useful Gaussian integral: for any η > 0, Z 2η x 2 dx = (2πη)d/2. The following lemma shows that f η y as in (6) is close to a strongly convex and smooth function when η is small. This result is used in Proposition 3.2. Lemma A.3 Let f η y := f + 1 2η y 2 and (f η y ) := f + 1 η( y), then we have for every u, v Rd, η M u v 2 (1 α)δ 2 f η y (u) f η y (v) (f η y ) (v), u v η + M u v 2 + (1 α)δ Proof: Using the definitions of f η y and (f η y ) , we have f η y (u) f η y (v) (f η y ) (v), u v =f(u) f(v) f (v), u v + 1 =f(u) f(v) f (v), u v + 1 The lemma now follows from above identity and (4). The next lemma gives equivalent formulas of R exp( hw 1,y(x))dx and R exp( hw 2,y(x))dx that are useful in Proposition 3.4. Lemma A.4 Recall w and w are defined in Proposition 3.2 and Lemma 3.3, respectively. Let hw 1,y and hw 2,y be as in (9) and (10), respectively. Then, we have the following integrals Z exp( hw 1,y(x))dx = 2πη 1 ηM d/2 exp (H1(w)) , (21) Z exp( hw 2,y(x))dx = 2πη 1 + ηM d/2 exp (H2(w )) , (22) 2η w 2 + η 2(1 ηM) s 2 f(w) + 1 2η y 2 + 1 α 2η w 2 f(w ) + f (w ), w 1 Published in Transactions on Machine Learning Research (06/2023) Proof: We first rewrite hw 1,y and hw 2,y as follows hw 1,y(x) =f(w) + f (w), x w M 2 x w 2 + 1 2η x y 2 (1 α)δ x + ηM 1 ηM w 1 1 ηM y + η 1 ηM f (w) 1 2η(1 ηM) ηMw y + ηf (w) 2 + f(w) f (w), w M hw 2,y(x) =f(w ) + f (w ), x w + M 2 x w 2 + 1 2η x y 2 + (1 α)δ x ηM 1 + ηM w 1 1 + ηM y + η 1 + ηM f (w ) 1 2η(1 + ηM) ηMw + y ηf (w ) 2 + f(w ) f (w ), w + M 2η y 2 + 1 α It follows from (25) and Lemma A.2 that Z exp( hw 1,y(x))dx = 2πη 1 ηM d/2 exp( ˆH1(w)), (27) Z exp( hw 2,y(x))dx = 2πη 1 + ηM d/2 exp( ˆH2(w )), (28) ˆH1(w) = 1 2η(1 ηM) ηMw y + ηf (w) 2 f(w) + f (w), w 2η y 2 + 1 α ˆH2(w ) = 1 2η(1 + ηM) ηMw + y ηf (w ) 2 f(w ) + f (w ), w It suffices to show that ˆH1(w) = H1(w) and ˆH2(w ) = H2(w ) to complete the proof. Published in Transactions on Machine Learning Research (06/2023) We first verify that ˆH1(w) = H1(w). Using the definition of ˆH1(w) above and the definition of s in (7), we have ˆH1(w) = 1 2η(1 ηM) ηMw w + w y + ηf (w) 2 f(w) + f (w), w 2η y 2 + 1 α = 1 2η(1 ηM) (ηM 1)w + ηs 2 f(w) + f (w), w 2η y 2 + 1 α 2 w 2 w, s + η 2(1 ηM) s 2 f(w) + f (w), w 2η y 2 + 1 α 2η w 2 + η 2(1 ηM) s 2 f(w) + 1 2η y 2 + 1 α In view of (23), we verify that ˆH1(w) = H1(w) and hence (21) is proved. We next verify that ˆH2(w ) = H2(w ). Using the definition of ˆH2(w ) and (8), we have ˆH2(w ) = 1 2η(1 + ηM) ηMw + w w + y ηf (w ) 2 f(w ) + f (w ), w = 1 2η(1 + ηM) (ηM + 1)w 2 M 2 w 2 f(w ) + f (w ), w 2η w 2 f(w ) + f (w ), w 1 In view of (24), we verify that ˆH2(w ) = H2(w ) and hence (22) is proved. B Missing proofs B.1 Proof of Lemma 3.1 It follows from the assumption that f is Lα-semi-smooth that for every u, v Rd, |f(u) f(v) f (v), u v | Lα α + 1 u v α+1. (29) Using the Young s inequality ab ap/p + bq/q with a = Lα (α + 1)δ 1 α 2 u v α+1, b = δ 1 α 2 , p = 2 α + 1, q = 2 1 α, we obtain Lα α + 1 u v α+1 L 2[(α + 1)δ] 1 α α+1 u v 2 + (1 α)δ Plugging the above inequality into (29), we have |f(u) f(v) f (v), u v | L 2[(α + 1)δ] 1 α α+1 u v 2 + (1 α)δ Published in Transactions on Machine Learning Research (06/2023) This inequality and the definition of M in (5) imply (4). B.2 Proof of Proposition 3.2 It follows from Lemma A.3 that f η y satisfies (38) with η + M, θ = (1 α)δ Since η 1/(Md), it is easy to verify that the assumption on ρ (i.e., Md in our case) in Proposition D.4 is satisfied. Hence, it follows from Proposition D.4 and (30) that the proposition holds. C Sampling from regularized convex semi-smooth potentials This section presents an alternative approach via regularization for sampling in the same setting as in Section 5, i.e., f is convex and satisfies (1). More specifically, the alternative complexity result is obtained by first applying Theorem 2.2 on a regularized convex semi-smooth potential and then specifying the regularization parameter. We consider the following regularized potential with some µ > 0, ˆf( ) = f( ) + µ 2 x0 2, (31) which is clearly µ-strongly convex by construction. Hence, exp( ˆf) satisfies µ-LSI and Theorem 2.2 is applicable. Since ˆf is µ-strongly convex, improved versions of results in Section 4 can be obtained. We omit the proofs since they can be easily obtained by following similar ideas in the proofs given in Section 3. Lemma C.1 Assume f is a convex function and satisfies (1), then for δ > 0 and every u, v Rd, we have ˆf(u) ˆf(v) ˆf (v), u v Mn + µ ˆf(u) ˆf(v) ˆf (v), u v µ where ˆf and Mn are as in (31) and (17), respectively. Proposition C.2 Assume η 1 Mnd, and let w Rd be an approximate stationary point of ˆf η y (x) := ˆf(x) + 1 2η x y 2 , (32) Mnd, s = ˆf (w) + 1 η (w y), (33) where Mn is as in (17). Then, the iteration-complexity to find w by using Algorithm 3 is O(1). Lemma C.3 Let w Rd be a stationary point of ˆf η y , i.e., ˆf (w ) + 1 η (w y) = 0. ˆhw 1,y(x) := ˆf(w) + ˆf (w), x w + µ 2 x w 2 + 1 ˆhw 2,y(x) := ˆf(w ) + ˆf (w ), x w + Mn + µ 2 x w 2 + 1 Published in Transactions on Machine Learning Research (06/2023) where w is as in (33). Then, we have for every x Rd, ˆhw 1,y(x) ˆf η y (x) ˆhw 2,y(x). Proposition C.4 Assume f is convex and satisfies (1) and let ˆf η y be as in (32). Then X generated by Algorithm 2 (with (7), f η y , and hw 1,y(x) replaced by (33), ˆf η y , and ˆhw 1,y(x), respectively) follows the distribution πX|Y (x | y) exp ˆf η y (x) . Moreover, if η 1 Mnd, then the expected number of rejection steps in Algorithm 2 is at most exp Pn i=1 (1 αi)δ Proposition C.5 Assume f is convex and satisfies (1), and let ˆf be as in (31). With initial distribution ρX 0 and stepsize η 1/ Pn i=1 L 2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO achieves ε error in terms of KL divergence with respect to exp( ˆf) in 2 αi+1 αi d µ iteration, and each iteration queries O(1) subgradient oracle of f and O(1) Gaussian distribution sampling oracle. Proof: By construction, ˆf is µ-strongly convex and exp( ˆf) satisfies µ-LSI. Using Theorem 2.2, starting from initial distribution ρX 0 , it takes 1 2µη log HπX(ρX 0 ) ε (35) iterations for Algorithm 1 to achieve ε error to exp( ˆf) with respect to KL divergence. By Propositions C.2 and C.4, Algorithm 2 queries O(1) subgradient oracle of f and O(1) Gaussian distribution sampling oracle. Complexity (34) then follows by plugging η into (35). Building upon Proposition C.5 for sampling from exp( ˆf(x)) = exp( f(x) µ x x0 2/2), we establish a complexity result, in the following theorem, to sample from the original target distribution πX exp( f) by choosing a proper regularization constant µ. Theorem C.6 Let πX exp( f(x)) be the target distribution where f is convex and satisfies (1). Let x0 Rd and ε > 0 be given and set 2 M4 + x0 xmin 2 (36) where M4 = R Rd x xmin 4dπX(x) and xmin Argmin {f(x) : x Rd}. With initial distribution ρX 0 and stepsize η 1/ Pn i=1 L 2 αi+1 αi d , Algorithm 1 using Algorithm 2 as an RGO for step 1, applied to ν exp( ˆf) = exp( f µ x0 2/2) has the iteration-complexity bound 2 αi+1 αi d M4 + x0 xmin 2 to achieve ε error to πX in terms of total variation. Proof: Let ρX denote the distribution of the samples generated by Algorithm 1 using Algorithm 2 as an RGO. Following the proof of Corollary 4.1 of Chatterji et al. (2020), we obtain ρX πX TV ρX ν TV + ν πX TV Published in Transactions on Machine Learning Research (06/2023) Rd[f(x) ˆf(x)]2dπX(x) 1/2 = 1 2 x x0 2 2 dπX(x) 1/2 x xmin 2 + xmin x0 2 2 dπX(x) 1/2 2 x xmin 4 + 2 xmin x0 4 dπX(x) 1/2 2µ 2 M4 + xmin x0 4 1/2 M4 + x0 xmin 2 = ε 2 where the last identity is due to the definition of µ in (36). Hence, it suffices to derive the iteration-complexity bound for Algorithm 1 to achieve ρX πX TV ε/2, which is (37) in view of Proposition C.5 with µ as in (36). Note that even though the complexity result in Proposition C.5 is with respect to KL divergence, one can get the same order of complexity with respect to total variation using Pinsker inequality. D Solving the optimization problem In this section, we use Nesterov s acceleration to establish the iteration-complexity for solving a general optimization problem that is nearly strongly convex and nearly smooth. This general result is then applied in Section 3 to find an approximate stationary point of f η y (see Proposition 3.2). We consider the optimization problem min{g(x) : x Rd}, where g satisfies 2 u v 2 θ g(u) g(v) g (v), u v L 2 u v 2 + θ, u, v Rd, (38) for some given θ > 0, µ 0, and L 0. We use Nesterov s accelerated gradient method to find a ρapproximate stationary point w such that g (w) ρ. We also establish the iteration-complexity to obtain a ρ-approximate stationary point. Algorithm 3 Accelerated Gradient Method 0. Let an initial point x0, parameters L, µ > 0 be given, and set y0 = x0, A0 = 0, τ0 = 1, and k = 0; 1. Compute ak = τk + p τ 2 k + 4τk LAk 2L , Ak+1 = Ak + ak, (39) τk+1 = τk + akµ, xk = Akyk + akxk Ak+1 ; (40) yk+1 := argmin u Rd 2 u xk 2 , (41) xk+1 := argmin u Rd n akγk(u) + τk 2 u xk 2o , (42) where γk(u) := g( xk) + g ( xk), u xk + µ 2 u xk 2; (43) 3. Set k k + 1 and go to step 1. Lemma D.1 For every k 0 and u Rd, we have θ g(u) γk(u) L µ 2 u xk 2 + θ. (44) Published in Transactions on Machine Learning Research (06/2023) Proof: This lemma directly follows from (38) and the definition of γk in (43). Lemma D.2 For every k 0, the following statements hold: a) Ak+1 = La2 k/τk; c) Pk i=0 Ai+1 exp(2(k+1)(β 1)/β) 1 L(β2 1) where β = 1 + µ 2 Proof: a) This statement directly follows from (39). b) It is easy to see from (39), A0 = 0 and τ0 = 1 that A1 = 1/L. Using the definitions of Ak and τk in (39) and (40), respectively, and the facts that A0 = 0 and τ0 = 1, we easily derive that τk = τ0 + Akµ = 1 + Akµ. (45) It follows from (39) that Ak+1 = Ak + ak = Ak + τk + p τ 2 k + 4τk LAk The above inequality and (45) imply Ak + 1 + Akµ This statement now follows from the above relation and the fact that A1 = 1/L. c) Noting from b) that Ak+1 β2k/L, which together with the fact x exp((x 1)/x) for x 1, implies that k X i=0 β2i = β2(k+1) 1 L(β2 1) exp(2(k + 1)(β 1)/β) 1 Lemma D.3 For every k 0, define tk(u) = Ak [g(yk) g(u)] + τk 2 u xk 2, (46) then for every u Rd, we have 2 Ak+1 yk+1 xk 2 tk(u) tk+1(u) + 2Ak+1θ. (47) Proof: Using the fact γk is convex and the definition of xk in (40), we have Akγk(yk) + akγk(u) + τk + τk A2 k+1 2a2 k Ak+1 min γk(u) + L γk(yk+1) + L 2 yk+1 xk 2 , Published in Transactions on Machine Learning Research (06/2023) where the first identity is due to (39) and the second identity is due to the definition of yk+1 in (41). It follows from the second inequality of (44) with u = yk+1 and the above inequality that Ak+1 h g(yk+1) θ + µ 2 yk+1 xk 2i γk(yk+1) + L 2 yk+1 xk 2 Akγk(yk) + akγk(xk+1) + τk 2 xk+1 xk 2 Akγk(yk) + akγk(u) + τk 2 u xk 2 τk+1 where the last inequality is due to (42) and the fact that akγk + τk xk 2/2 is τk+1-strongly convex. Rearranging the terms in the above inequality, we obtain µ 2 Ak+1 yk+1 xk 2 Akγk(yk) + akγk(u) + τk 2 u xk 2 τk+1 2 u xk+1 2 Ak+1 [g(yk+1) θ] =Ak [g(yk) g(u)] + τk 2 u xk 2 Ak+1 [g(yk+1) g(u)] τk+1 + Ak [γk(yk) g(yk)] + ak [γk(u) g(u)] + Ak+1θ tk(u) tk+1(u) + 2Ak+1θ where the identity is due to the fact that Ak+1 = Ak + ak, and the last inequality is due to (46) and the first inequality of (44). Proposition D.4 If ρ 2 θ/ µ, then the number of iterations k0 to obtain a ρ-approximate stationary point of g is at most L + µ 2 µ log (µ + L)2d2 0 ρ2 2 L + µ 2 µ + 1 Proof: It follows from the optimality condition of (41) that g ( xk) = (µ + L)( xk yk+1). Using the above relation and summing (47) with u = x from k = 0 to k 1, we have µ 2(µ + L)2 i=0 Ai+1 g ( xi) 2 = µ i=0 Ai+1 yi+1 xi 2 i=0 Ai+1θ = d2 0 2 + 2 where the last identity follows from the facts that A0 = 0 and τ0 = 1. The above inequality and the assumption on ρ imply that min 0 i k 1 g ( xi) 2 (µ + L)2 d2 0 Pk 1 i=0 Ai+1 + 4θ µ d2 0 Pk 1 i=0 Ai+1 + ρ2 In order to show min0 i k 1 g ( xi) ρ, it suffices to show µ d2 0 Pk 1 i=0 Ai+1 ρ2 Using Lemma D.2 (c) and the fact that k k0 where k0 is as in (48), we have i=0 Ai+1 exp(2k(β 1)/β) 1 L(β2 1) 2(µ + L)2d2 0 µρ2 , and hence (49) is proved. Published in Transactions on Machine Learning Research (06/2023) E Approximate implementations of ASF The implementation of RGO in Algorithm 2 is exact, hence the samples of both RGO and ASF are unbiased. In contrast, it is shown in Theorem 1 of Wibisono (2019) (resp., Theorem 2 of Vempala & Wibisono (2019)) that the proximal Langevin algorithm (PLA) (resp., LMC) is biased. From the perspective of proximal sampling, we give an explanation of this fact by showing that PLA and its equivalent method, namely, proximal Langevin Monte Carlo (PLMC) of Bernton (2018), can be viewed as an instance of ASF whose implementation of RGO is inexact. Assume the potential function f is convex and smooth. Recall that PLMC iteratively generates samples as follows: given yk Rd, then the next sample yk+1 takes the form of yk+1 = proxηf(yk) + 2ηz where z N(0, I). It is easy to verify that the following algorithm gives an equivalent form of PLMC. Algorithm 4 Proximal Langevin Monte Carlo (Bernton, 2018) 1. Sample xk exp[ 1 2η x proxηf(yk) 2] 2. Sample yk+1 πY |X(y | xk) exp[ 1 Next, we show that PLMC is an approximate implementation of ASF. Similar to ASF, PLMC also alternates between steps 1 and 2. More specifically, step 2 of PLMC plays the same role as step 1 of ASF, and step 1 of PLMC can be viewed as sampling from the proposal density exp[ h1(x)] without rejection, where h1(x) := f η yk(proxηf(yk)) + 1 2η x proxηf(yk) 2 f η yk(x). Since f η yk(x) is the potential function of the RGO in step 2 of ASF. Hence, step 1 of PLMC is an approximate implementation of the RGO. As a result, both PLMC and PLA are approximate implementations of ASF, and thus they are biased. Following a similar argument, we show that LMC can be viewed as an instance of ASF whose implementation of RGO is inexact. Assume f in the target distribution π exp( f) is convex and smooth and recall that the iterative step in LMC can be described as yk+1 = yk η f(yk) + p 2ηz, z N(0, I). (50) We claim that the following algorithm gives an equivalent form of LMC (50) from the proximal sampling perspective. Algorithm 5 Langevin Monte Carlo 1. Sample yk πY |X(y | xk) exp[ 1 2η xk y 2] 2. Sample xk+1 exp[ 1 2η x yk + η f(yk) 2] Indeed, steps 1 and 2 can be equivalently written as xk+1 = yk η f(yk) + ηzk, zk N(0, I), yk+1 = xk+1 + ηz k, z k N(0, I), where yk+1 is the sample from step 1 in the next iteration. Combining the above identities, we have yk+1 = yk η f(yk) + η(zk + z k) d= yk η f(yk) + p 2ηz, z N(0, I). Moreover, LMC and ASF share the same step 1, and step 2 of LMC equivalently generates xk+1 from exp[ h1(x)] where h1(x) := f(yk) + f(yk), x yk + 1 2η x yk 2. (51) Using the definition of h1 in (51) and the convexity of f, we have h1(x) f(x) + 1 2η x yk 2 = f η yk(x). Published in Transactions on Machine Learning Research (06/2023) Note that f η yk(x) is the potential function of the RGO in step 2 of ASF. Hence, step 2 of LMC can be interpreted as an RGO implementation with the proposal density exp[ h1(x)] but without rejection. As a result, LMC is an approximate implementation of ASF and thus LMC is biased. It is worth noting that many other sampling algorithms, for example, symmetric Langevin algorithm of Wibisono (2018) and Metropolis-adjusted proximal gradient Langevin dynamics of Mou et al. (2022a), can also be shown to be approximate implementations of ASF in an analogous way.