# sampling_from_logconcave_distributions_with_infinitydistance_guarantees__4931782c.pdf Sampling from Log-Concave Distributions with Infinity-Distance Guarantees Oren Mangoubi Worcester Polytechnic Institute Nisheeth K. Vishnoi Yale University For a d-dimensional log-concave distribution fi( ) à e f( ) constrained to a convex body K, the problem of outputting samples from a distribution which is Á-close in infinity-distance sup œK | log ( ) fi( )| to fiarises in differentially private optimization. While sampling within total-variation distance Á of fican be done by algorithms whose runtime depends polylogarithmically on 1 Á, prior algorithms for sampling in Á infinity distance have runtime bounds that depend polynomially on 1 Á. We bridge this gap by presenting an algorithm that outputs a point Á-close to fiin infinity distance that requires at most poly(log 1 Á, d) calls to a membership oracle for K and evaluation oracle for f, when f is Lipschitz. Our approach departs from prior works that construct Markov chains on a 1 Á2 -discretization of K to achieve a sample with Á infinity-distance error, and present a method to directly convert continuous samples from K with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension d in the running time for the problem of sampling from a log-concave distribution on polytopes K with infinity distance Á, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain. 1 Introduction The problem of sampling from a log-concave distribution is as follows: For a convex body K Rd and a convex function f : K æ R, output a sample from the distribution fi( ) à e f( ). This is a basic problem in computer science, statistics, and machine learning, with applications to optimization and integration [1, 29], Bayesian statistics [39], reinforcement learning [6], and differential privacy [32, 20, 2, 24]. Sampling exactly from fiis known to be computationally hard for most interesting cases of K and f [16] and, hence, the goal is to output samples from a distribution that is at a small (specified) distance to fi. For applications such as computing the integral of fi, bounds in the total variation (TV) distance [1] or KL divergence (which implies a TV bound) are sufficient. In applications such as computing the expectation of a Lipschitz function with respect to fi, Wasserstein distance may also be sufficient. In differentially private optimization [32, 20, 2, 19, 24], one requires bounds on the stronger infinity-distance dŒ( , fi) := sup ----log ( ) to guarantee pure differential privacy, and TV, KL, or Wasserstein bounds are insufficient; see [14]. Pure differential privacy (DP) is the strongest notion of DP and has been extensively studied (see e.g. the survey [14]). It has advantages over weaker notions of differential privacy. E.g., when privacy of groups of individuals (rather than just single individuals) must be preserved, any mechanism which is (pure) Á-DP (with respect to single individuals), is also kÁ-DP with respect to subsets of k individuals. Motivated by applications to differential privacy, we study the problem of designing efficient algorithms to output samples from a distribution which is Á-close in dŒ to fi. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Related works. Several lines of work have designed Markov chains that generate samples from distributions that are close to a given log-concave distribution. These results differ in both their assumptions on the log-density and its support, as well as the distance used to measure closeness. One line of work includes bounds for sampling from a log-concave distribution on a compactly supported convex body within TV distance O( ), including results with running time that is polylogarithmic in 1 [1, 30, 29, 34] (as well as other results which give a running time bound that is polynomial in 1 [18, 17, 5, 4]). In addition to assuming access to a value oracle for f, some Markov chains just need access to a membership oracle for K [1, 30, 29], while others assume that K is a given polytope: { œ Rd : A Æ b} [23, 33, 35, 34, 26]. They often also assume that K is contained in a ball of radius R and contains a ball of smaller radius r. Many of these results assume that the target logconcave distribution satisfies a well-rounded condition which says that the variance of the target distribution is Θ(d) [30, 29], or that it is in isotropic position (all eigenvalues of its covariance matrix are Θ(1)) [25]; when applied to log-concave distributions that are not well-rounded or isotropic, these results require a rounding pre-processing procedure to find a linear transformation which makes the target distribution well-rounded or isotropic. Finally, it is often assumed that the function f is such that f is L-Lipschitz or -smooth [34], including works handling the widely-studied special case when f is uniform on K where L = = 0 (see e.g. [28, 23, 33, 35, 26, 8, 21]). Another line of work gives sampling algorithms with bounds on the distance to the target density fi in terms of Wasserstein distance [13, 11], KL divergence [40, 12], and Renyi divergence [36]. In contrast to works which assume access to an oracle for the value of f, many of these results instead assume access to an oracle for the gradient of f and require the log-density to be L-Lipschitz or -smooth on all of Rd (or on, e.g., a cube containing K) for some L, > 0. However, as noted earlier, bounds in the Wasserstein distance, KL divergence, and -Renyi divergence (for < Œ) also do not imply bounds on the infinity distance, and the running time bounds provided by these works are polynomial in 1 Á. (See also Appendix A for additional discussion and challenges.) Among prior works that give algorithms with bounds on dŒ, [20] applies the grid walk Markov chain of [1] to sample from a uniform distribution on a convex body. [2] extends the approach of [20] to log-Lipschitz log-concave distributions. Unlike the TV-distance case where algorithms whose running time depends logarithmically on the error are known (e.g., [1, 29, 34]), the best available bounds for sampling within O(Á) infinity-distance [20, 2] have runtime that is polynomial in 1 Á and a relatively large polynomial in d. Our contributions. We present a new approach to output samples, which come with dŒ bounds, from a log-concave and log-Lipschitz distribution constrained to a a convex body. Specifically, when K := { : A Æ b} is a polytope (where one is given A and b) our main result (Theorem 2.1) guarantees samples from a distribution that is within O(Á) error in dŒ and whose runtime depends logarithmically on 1 Á compared to the polynomial dependence of [2]. Our approach departs from prior works that construct Markov chains on a 1 Á2 -discretization of K to achieve a sample with Á infinity-distance error, and we present a method (Algorithm 1) to directly convert continuous samples from K with total-variation bounds to samples with infinity bounds (Theorem 2.2). This continuous-space approach also allows us to obtain an improvement on the dimension d in the running time when K is a polytope by plugging in TV-distance running time bounds for the Dikin Walk Markov chain of [34]. As immediate applications, we obtain faster algorithms for differentially private empirical risk minimization (Corollary 2.4) and low rank approximation (Corollary 2.5). Let B(v, s) := {z œ Rd : Îz vÎ2 Æ s} and Ê denote the matrix-multiplication constant. Theorem 2.1 (Main result) There exists an algorithm which, given Á, L, r, R > 0, A œ Rm d, b œ Rm that define a polytope K := { œ Rd : A Æ b} contained in a ball of radius R, a point a œ Rd such that K contains a ball B(a, r) of smaller radius r, and an oracle for the value of a convex function f : K æ Rd, where f is L-Lipschitz, and defining fito be the distribution fià e f, outputs a point from a distribution such that dŒ( , fi) < Á. Moreover, with very high probability1, this algorithm takes O(T) function evaluations and O(T mdÊ 1) arithmetic operations, where T = O((m2d3 + m2d L2R2) [LR + d log( Rd+LRd 1The number of steps is O( T), where E[ ] Æ 3, P( Ø t) Æ ! 2 "t for t Ø 0, and Æ O(d log( R r ) + LR) w.p. 1. In comparison to the polynomial in 1 Á runtime bounds of [2], Theorem 2.1 guarantees a runtime that is logarithmic in 1 Á, and also improves the dependence on the dimension d, in the setting where K is a polytope. Specifically, [2] show that the number of steps of the grid walk to sample from fiwith infinity-distance error at most Á is Á2 (d10 + d6L4R4) polylog r , R, L, d (Lemma 6.5 in the Arxiv version of [2]). When applying their algorithm to the setting where f is constrained to a polytope K = {x œ Rd : Ax Æ b}, each step of their grid walk Markov chain requires computing a membership oracle for K and the value of the function f. The membership oracle can be computed in O(md) arithmetic operations. Thus the bound on the number of arithmetic operations for each step of their grid walk is O(md) (provided that each function evaluation takes at most O(md) arithmetic operations). Thus the bound on the number of arithmetic operations to obtain a sample from fiis O( 1 Á2 (md11 + md7L4R4) polylog( 1 r, R, L, d)). Thus, Theorem 2.1 improves on this bound by a factor of roughly 1 Á2m3 d8 Ê. For example, when m = O(d), as may be the case in differentially private applications, the improvement is 1 We note that the bounds of [2] also apply in the more general setting where K is a convex body with membership oracle. One can extend our bounds to achieve a runtime that is logarithmic in 1 Á (and polynomial in d, L, R) in the more general setting where K is a convex body with membership oracle; we omit the details (see Remark 2.3). Moreover, we also note that while there are several results which achieve O( ) TV bounds in time logarithmic in 1 , TV bounds do not in general imply O(Á) bounds on the KL or Renyi divergence, or on the infinity-distance, for any > 0.2 On the other hand, an Á-infinity-distance bound does immediately imply a bound of Á on the KL divergence DKL, and -Renyi divergence D , since DKL(µ, fi) Æ dŒ(µ, fi) and D (µ, fi) Æ dŒ(µ, fi) for any > 0 and any pair of distributions µ, fi. Thus, under the same assumptions on K and f, Theorem 2.1 implies a method of sampling from a Lipschitz concave log-density on K with Á KL and Renyi divergence error in a number of arithmetic operations that is logarithmic in 1 Á, with the same bound on the number of arithmetic operations. The polynomial dependence on 1 Á in [2] is due to the fact that they rely on a discrete-space Markov chain [1], on a grid with cells of width w = O( Á L d), to sample from fiwithin O(Á) infinity-distance. Since their Markov chain s runtime bound is polynomial in w 1, they get a runtime bound for sampling within O(Á) infinity-distance that is polynomial in 1 Á. The proof of Theorem 2.1 bypasses the use of discrete grid-based Markov chains by introducing Algorithm 1 which transforms any sample within = O(Áe d n LR) TV distance of the distribution fià e f on the continuous set K (as opposed to a discretization of K), into a sample within O(Á) infinity-distance from fi. This allows us to make use of a continuous-space Markov chain, whose step size is not restricted to a grid of width O( Á L d) and is instead independent of Á, to obtain a sample within O(Á) infinity-distance from fiin time that is logarithmic in 1 Theorem 2.2 (Main technical contribution: Converting TV bounds to infinity-distance bounds) There exists an algorithm (Algorithm 1) which, given Á, r, R, L > 0, a membership oracle for a convex body K contained in a ball of radius R and containing a ball B(0, r), and an oracle which outputs a point from a distribution µ which has TV distance 3R(d log(R/r) + LR)2 from a distribution fià e f where f : K æ R is an L-Lipschitz function (see Appendix B for the exact values of and related hyper-parameters), outputs a point ˆ œ K such that the distribution 2For instance, if fi( ) = 1 with support on [0, 1], for every > 0 there is a distribution where Î fiÎTV Æ 2 and yet dŒ( , fi) Ø DKL( , fi) Ø 1 2. ( ( ) = e 1 on œ [0, e 1 , 1] and ( ) = 0 otherwise) of ˆ satisfies dŒ( , fi) Æ Á. Moreover, with very high probability3, this algorithm finishes in O(1) calls to the sampling and membership oracles, plus O(d) arithmetic operations. To the best of our knowledge Theorem 2.2 is the first result which for any Á, L, r, R > 0, when provided as input a sample from a continuous-space distribution on a convex body K within some TV distance = (Á, r, R, L) > 0 from a given L-log-Lipschitz distribution fion K, where K is contained in a ball of radius R and containing a ball of smaller radius r, outputs a sample with distribution within infinity-distance O(Á) from fi. This is in contrast to previous works [2] (see also [20] which applies only to the special case where fiis the uniform distribution on K) which instead require as input a sample with bounded TV distance from the restriction of fion a discrete grid on K, and then convert this discrete-space sample into a sample within infinity-distance O(Á) from the continuous-space distribution fi: K æ R. Algorithm 1: Interior point TV to infinity-distance converter Input: d œ N Input: A membership oracle for a convex body K œ Rd and an r > 0 such that B(0, r) K. Input: A sampling oracle which outputs a point from a distribution µ : K æ R Output: A point ˆ œ K. 1 Hyperparameters: > 0, max œ N (set in Appendix B) 2 for i = 1, . . . , max do 3 Sample a point µ 4 Sample a point Unif(B(0, 1)) 5 Set Z Ω + r 6 Set ˆ Ω 1 1 Z 7 If ˆ œ K, output ˆ with probability 1 2 and halt. Otherwise, continue. 9 Sample a point ˆ Unif(B(0, r)) 10 Output ˆ Remark 2.3 (Extension to convex bodies with membership oracles) We note that Theorem 2.1 can be extended to the general setting where K is an arbitrary convex body in a ball of radius R and containing a ball of smaller radius r, and we only have membership oracle access to K. Namely, one can plug in the results of [29] to our Theorem 2.2 to generate a sample from a LLipschitz concave log-density on an arbitrary convex body K in a number of operations that is (poly)-logarithmic in 1 r and polynomial on d, L, R. We omit the details. Applications to differentially private optimization. Sampling from distributions with O(Á) infinity-distance error has many applications to differential privacy. Here, the goal is to find a randomized mechanism h : Dn æ R which, given a dataset x œ Dn consisting of n datapoints, outputs model parameters ˆ œ R in some parameter space R, which minimize a given (negative) utility function f( , x), under the constraint that the output ˆ preserves the pure Á-differential privacy of the data points x. A randomized mechanism h : Dn æ R is said to be Á-differentially private if for any datasets x, xÕ œ D which differ by a single datapoint, and any S R, we have that P(h(x) œ S) Æ eÁP(h(xÕ) œ S); As one application of Theorem 2.1, we consider the problem of finding an (approximate) minimum ˆ of an empirical risk function f : K Dn æ R under the constraint that the output ˆ is Ádifferentially private, where f( , x) := qn i=1 i( , xi). Following [2], we assume that the i( , x) are L-Lipschitz for all x œ Dn, i œ N, for some given L > 0. In this setting [2] show that the minimum ERM utility bound under the constraint that ˆ is pure Á-differentially private, Eˆ [f(ˆ , x)] min œK f( , x) = Θ( d LR Á ), is achieved if one samples ˆ from the exponential mechanism fià e Á 2LR f with infinity-distance error at most O(Á). Plugging Theorem 2.1 into the framework of 3Algorithm 1 finishes in calls to the sampling and membership oracles, plus O( d) arithmetic operations, where E[ ] Æ 3 and P( Ø t) Æ ! 2 "t for all t Ø 0 and Æ 5d log( R r ) + 5LR + 2 w.p. 1. the exponential mechanism, we obtain a pure Á-differentially private mechanism which achieves the minimum expected risk (Corollary 2.4, see Section C.1 for a proof). Corollary 2.4 (Differentially private empirical risk minimization) There exists an algorithm which, given Á, L, r, R > 0, A œ Rm d, b œ Rm that define a polytope K := { œ Rd : A Æ b} contained in a ball of radius R and containing a ball B(0, r) of smaller radius r, and a convex function f( , x) := qn i=1 i( , xi), where each i : K æ R is L-Lipschitz, outputs a random point ˆ œ K which is pure Á-differentially private and satisfies Eˆ [f(ˆ , x)] min œK f( , x) Æ O Moreover, this algorithm takes at most T mdÊ 1 arithmetic operations plus T evaluations of the function f, where T = O (m2d3 + m2dn2Á2" (Án + d)log2( n Rd Corollary 2.4 improves on the previous bound [2] of O(( 1 Á2 (m + n)d11 + Á2n4(m + n)d7) polylog( n Rd rÁ ))) arithmetic operations by a factor of roughly max d8 Ê Á2m2 , 1 Ám2 nd52 , in the setting where the i are L-Lipschitz on a polytope K and each i can be evaluated in O(d) operations. See Appendix C.1 for a proof of this corollary. As another application, we consider the problem of finding a low-rank approximation of a sample covariance matrix Σ = qn i where the datapoints ui œ Rd satisfy ÎuiÎ Æ 1, in a differentially private manner. Given any k > 0, the goal is to find a (random) rank-k projection matrix P which maximizes the average variance EP [ÈΣ, PÍ] of the matrix Σ (also reffered to as the utility of P), under the constraint that the mechanism which outputs the matrix P is Á-differentially private. This problem has many applications to statistics and machine learning, including differentially private principal component analysis (PCA) [7, 3, 15, 24]. When privacy is not a concern, the solution P which maximizes the variance is just a projection matrix onto the subspace spanned by top-k eigenvectors of Σ, and the maximum variance satisfies ÈΣ, PÍ = qk i=1 i, where 1 Ø Ø d > 0 denote the eigenvalues of Σ. However, when privacy is a concern, there is a tradeoff between the desired privacy level Á and the utility EP [ÈΣ, PÍ], and the maximum utility EP [ÈΣ, PÍ] one can achieve decreases with the privacy parameter Á. The best current utility bound for an Á-differentially private low rank approximation algorithm was achieved in [24], who show that one can find a pure Á-differentially private random rank-k projection matrix P such that EP [ÈΣ, PÍ] Ø (1 ) qk i=1 i whenever qk for any > 0. To generate the matrix P, their algorithm generates a sample, with infinity-distance error O(Á), from a Lipschitz concave log-density on a polytope, and transforms this sample into a projection matrix P. The sampling algorithm used in [24] has a bound of poly( 1 Á, d, 1 d) arithmetic operations and they leave as an open problem whether this can be improved from a polynomial dependence on 1 Á to a logarithmic dependence on 1 Á. Corollary 2.5 shows that a direct application of Theorem 2.1 resolves this problem. (See Section C.2 for a proof.) Corollary 2.5 (Differentially private low rank approximation) There exists an algorithm which, given a sample covariance matrix Σ = qn i for datapoints ui œ Rd satisfying ÎuiÎ Æ 1, its eigenvalues 1 Ø . . . d > 0, an integer k, and Á, > 0, outputs a random rank-k symmetric projection matrix P such that P is Á-differentially private and satisfies the utility bound EP [ÈΣ, PÍ] Ø (1 ) whenever qk i=1 i(Σ) Ø C dk for some universal constant C > 0. Moreover the number of arithmetic operations is logarithmic in 1 Á and polynomial in d and 1 d. 3 Proof Overviews Given any Á, and a function f : Rd æ R, the goal is to sample from a distribution fi( ) à e f( ), constrained to a d-dimensional convex body K with infinity-distance error at most O(Á) in a number of arithmetic operations that is logarithmic in 1 Á. We assume that K is contained in a ball of some radius R > 0 and contains a ball of some radius r > 0, and f is L-Lipschitz. In addition we would also like our bounds to be polylogarithmic in 1 r, and polynomial in d, L, R with a lowerorder dependence on the dimension d than currently available bounds for sampling from Lipschitz concave log-densities on a polytope in infinity-distance [2]. We note that since whenever K is contained in a ball of radius R and contains a ball B(0, r) of smaller radius r, we also have that B(0, r) K B(0, 2R), without loss of generality, we may assume that B(0, r) K B(0, R) as this would only change the bounds provided in our main theorems by a constant factor. The main ingredient in the proof of Theorem 2.1 is Theorem 2.2 that uses Algorithm 1 to transform a TV-bounded sample into a sample from fiwith error bounded in dŒ. Subsequently, we invoke Theorem 2.1 when K is given as a polytope K := {x œ Rd : Ax Æ b} and plug in the Dikin Walk Markov chain of [34] which generates independent samples from fiwith bounded TV error. We first present an overview of the proof of Theorem 2.2. (The full proof has been omitted to space restrictions and presented in Appendix B.) The proof of Theorem 2.1 is presented in Section 3.2. 3.1 Converting samples with TV bounds to infinity-distance bounds; proof of Theorem 2.2 Impossibility of obtaining log-dependence on infinity-distance via grid walk. One approach is to observe that if e f has support on a discrete space S with at most |S| points, then any such that Î fiÎTV Æ Á also satisfies dŒ( , fi) Æ 2|S|maxzœS e f(z) minzœS e f(z) Á for any Á Æ minzœS fi(z). This suggests forming a grid G over K, then using a discrete Markov chain to generate a sample within O(Á) TV distance of the discrete distribution fiG à e f with support on the grid G, and then designing an algorithm which takes as input and outputs a point with bounded infinity-distance to the continuous distribution fi. This approach was used in [20] in the special case when fiis uniform on K, and then extended by [2] to log-Lipschitz log-concave distributions. In their approach, [2] first run a grid-walk Markov chain on a discrete grid in a cube containing K. They then apply the bound from [1] which says that the grid walk obtains a sample Z within TV distance O( ) from the distribution à e f (restricted to the grid) in time that is polylogarithmic in 1 and quadratic in a 1, where a is the distance between neighboring grid points. Since their grid has size |S| = Θ(( R a )d), a TV distance of O( ) automatically implies an infinity-distance of O( c|S|), where c is the ratio of the maximum to the minimum probability mass satisfying c Æ e LR since f is L-Lipschitz on K B(0, R). Thus, by using the grid walk to sample within TV-distance = O from the discrete distribution fiG, they obtain a sample Z which also has infinity-distance O(Á) from fiG. Finally, to obtain a sample from the distribution fià e f on the continuous space K, they sample a point uniformly from the grid cell [Z a, Z + a]d centered at Z. Since f is L-Lipschitz, the ratio e f(Z) e f(w) is bounded by O(Á) for all w in the grid cell [Z a, Z + a]d as long as a = O , implying that the sample is an infinity-distance of O(Á) from fià e f. However, since the running time bound of the grid walk is quadratic in a 1, the grid coarseness a = O needed to achieve O(Á) infinity-distance from fileads to a running time bound which is quadratic in 1 To get around this problem, rather than relying on the use of a discrete-space Markov chain such as the grid walk to sample within O(Á) infinity-distance from fi, we introduce an algorithm (Algorithm 1) which transforms any sample within = O TV distance from the distribution fià e f on the continuous space K (as opposed to a grid-based discretization of K), into a sample within O(Á) infinity-distance from fi. This allows us to make use of a continuous-space Markov chain, such as the Dikin walk of [34], whose step-size is not restricted by a grid of coarseness w = O and instead is independent of Á, in order to generate a sample within O(Á) infinity-distance from fi in runtime that is logarithmic in 1 Converting continuous space TV-bounded samples to infinity-distance bounded samples. As discussed in Section 1, there are many Markov chain results which allow one to sample from a log- concave distribution on K with error bounded in weaker metrics such as total variation, Wasserstein, or KL divergence. However, when sampling from a continuous distribution, bounds in these metrics do not directly imply bounds in infinity-distance. And techniques used to prove bounds in weaker metrics do not easily extend to methods for bounding the infinity-distance; see Section A.1. Convolving with continuous noise. As a first attempt, we consider the following simple algorithm: sample a point µ from a distribution µ with total variation error ε fiÎTV Æ O(Á). Since f is L-Lipschitz, for any < Á L and any ball B(z, ) in the -interior of K (denoted by int (K); see Definition B.1), we can obtain a sample from a distribution such that log Æ Á for all z œ int (K) by convolving µ with the uniform distribution on the ball B(0, ). Sampling from this distribution can be achieved by first sampling µ and then adding noise Unif(B(0, )) to the sample . Unfortunately, this simple algorithm does not allow us to guarantee that log( (z) µ(z)) Æ Á at points z /œ int (K) which are a distance less than from the boundary of K. To see why, suppose that K = [0, 1]d is the unit cube, that f is constant on K, and consider a point w = (1, . . . , 1) at the corner of the cube K. In this case we could have that (z) Æ 2 dfi(z) for all z in some ball containing w, and hence dŒ( , fi) = sup ---log( (z) --- Ø d log(2), no matter how small we make . Stretching the convex body to handle points close to the boundary. To get around this problem, we would like to design an algorithm which samples from some distribution such that --- Æ Á for all z œ K, including at points z near the corners of K. Towards this end, we first consider the special case where K is itself contained in the -interior of another convex body KÕ, the function f : K æ R extends to an L-Lipschitz function on KÕ (also referred to here with slight abuse of notation as f) and where we are able to sample from the distribution à e f on KÕ with O(Á) total variation error. If we sample e f on KÕ with total variation error O( ) where Æ Áe d log(R), add noise Unif(B(0, )) to for = LR, and then reject + only if it is not in K, we obtain a sample whose distribution is O(Á) from the distribution à e f on K in infinity-distance. However, we would still need to define and sample from such a convex body KÕ, and to make sure that KÕ is not too large when compared to K; otherwise the samples from the distribution à e f on KÕ may be rejected with high probability. Moreover, another issue we need to deal with is that f may not even be defined outside of K. Figure 1: The construction used in the proof of Lemma B.1. To get around these two problems, in Algorithm 1, we begin by taking as input a point µ sampled from some distribution µ supported on K where ε fiÎTV Æ for some Æ Áe d log(R), and add noise unif(B(0, r)) in order to sample from a distribution ˆµ which satisfies ---log ˆµ(z) for all z œ int r(K). Here r is the radius of the small ball contained in K; the choice of radius r for the noise distribution is because we will show in the following paragraphs that to sample from the distribution fion K with infinity-distance error Á it is sufficient sample a point in the r-interior of K and to then apply a stretching operation to K. We still need a method of sampling within O(Á) infinity-distance error of fion all of K, including in the region K\int r(K) near the boundary of K. Towards this end, after Algorithm 1 generates a point Z = + from the above-mentioned distribution ˆµ, it then multiplies Z by 1 1 and returns the resulting point ˆ := 1 1 Z if it is K, in other words, if Z œ (1 )K. If we can show that (1 )K int r(K), then this would imply that ---log ˆµ(z) --- Æ Á for all z œ (1 )K, and hence that the distribution of ˆ of the returned point ˆ satisfies ----log ˆ (z) fi((1 )z) ----log ˆµ((1 )z) ---- + log 1 (1 )d Æ O(Á) for all z œ K. Since f is L-Lipschitz we have ---log fi( ) fi((1 ) ) --- = O(Á) for all œ K, and hence we would then have that the distribution ˆ of the point ˆ returned by Algorithm 1 satisfies ----log ˆ (z) ----log ˆ (z) fi((1 )z) ---- + O(Á) Æ O(Á) z œ K. (1) However, for (1) to hold, we still need to show that (1 )K int r(K) (proved in Lemma B.1). In other words, we would like to show that for any point Z œ (1 )K, there is a ball B(Z, r) centered at Z of radius r contained in K. To show this fact, it is sufficient to consider the convex hull C of B(0, r) fi K, and show that it contains the ball B(Z, r). Towards this end, we make the following geometric construction: we let p be a point such that the line pˆ is tangent to B(0, r), and q the point on pˆ which minimizes the distance Îq ZÎ2 (see Figure 1). Since \0pˆ and \Zqˆ are both right angles, we have that the triangles 0pˆ and Zqˆ are similar triangles, and hence that ÎZ qÎ2 r = ÎZ ˆ Î2 Έ 0Î2 . In other words, ÎZ qÎ2 = ÎZ ˆ Î2 2 ... 1 1 Z implying a ball of radius r centered at Z is contained in C K, and hence that (1 )K int r(K). Bounding the infinity distance error. To complete the bound on the infinity-distance of the distribution ˆ of the point returned by Algorithm 1 to the target distribution fi, we must show both a lower bound (Lemma B.5) and an upper bound (Lemma B.6) on the ratio ˆ ( ) fi( ) at every point œ K. Both the upper and lower bounds are necessary to bound the infinity-distance dŒ(ˆ ( ), fi( )) = sup œK ---log ˆ ( ) Both Lemmas B.5 and B.6 require the input point to have TV error < Á( R r) de LR. The term ( R r) d is a lower bound on the ratio of the volume of K to the volume of the smoothing ball B(0, r); this bound holds since K is contained in a ball of radius R. The term e LR is a lower bound on the ratio minwœK fi(w) maxwœK fi(w) of the minimum value of the density fito the maximum value of fi at any two points in K; this bound holds since f is L-Lipschitz. The above choice of ensures that in any ball B(z, r) with center z in the r-interior of K, the distribution µ of the input point, which satisfies ε fiÎTV Æ , will have between e Á and eÁ times the probability mass which the target distribution fihas inside the ball B(z, r). Thus, when the distribution µ is smoothed by adding noise uniformly distributed on a ball of radius r, the smoothed distribution ( ) is within e Á and eÁ times the target probability density fi( ) at any point in the r-interior of K, allowing us to bound the infinity distance error of the smoothed distribution at any point in the r-interior of K. We then apply this fact, together with Lemma B.1 which says that for any point œ K the point (1 ) is in the r-interior of K, to bound the distribution of the output point (after the stretching operation) as follows, ( ) Ø (1 )d ((1 ) ) Ø (1 )dfi((1 ) ) e Á Ø fi( )e Á Here our choice of hyperparameter Æ Á max(d,LR) ensures that (1 )d = Ω(1) and, since f is L-Lipschitz, that fi((1 ) ) Ø e Áfi( ). This proves the lower bound (Lemma B.5). The proof of the upper bound (Lemma B.6) follows in a similar way as equation (2) but with the inequalities going in the opposite direction. Bounding the number of iterations and concluding the proof of Theorem 2.2. We still need to deal with the problem that the point ˆ may not be accepted. If this occurs, roughly speaking, we repeat the above procedure until a point ˆ with distribution ˆ is accepted. To bound the number of iterations, we show that ˆ is in K with high probability. Towards this end, we first use the facts that f is LLipschitz and K B(0, R), to show that the probability a point sampled from fià e f lies inside (1 )K is at least (1 )de L R Ø 9 10 (Lemma B.2). Lemma B.2 says that if you stretch the polytope by a factor of 1 1 , then most of the volume of the stretched polytope ( 1 1 )K remains inside the original polytope K. The term (1 )d is just the ratio of the volume of (1 )K to the volume of K. And, since f is L-Lipschitz, the term e L R bounds the ratio fi( ) fi( 1 1 ) of the target density at any point œ K to the value of fiat the point 1 1 to which the stretching operation transports , whenever 1 1 œ K. The choice of hyperparameter Æ Á max(d,LR) ensures that the acceptance probability (1 )de L R guaranteed by Lemma B.2 is at least 9 10. Since the convex body (1 )K contains the ball B(0, r 2), applying Lemma B.1 a second time (this time to the convex body (1 )K) we get that (1 3 )K int r((1 )K). Thus, by Lemma B.2 we have that lies inside int r((1 )K) with probability at least 9 10 Ø 8 10 (as is sampled from fiwith TV error Æ ). Therefore, since B(0, r), we must also have that the probability that the point ˆ = 1 1 ( + ) is in K (and is therefore not rejected) is greater than 8 4. This implies that the number of iterations until our algorithm returns a point ˆ is less than k > 0 with probability at least 1 2 k, and the expected number of iterations is at most 2 (proved in Corollary B.4). Since each iteration requires one random sample from the distribution µ, and one call to a membership oracle for K (to determine if 1 1 Z œ K), the number of sampling oracle and membership oracle calls required by Algorithm 1 is O(1) with very high probability. Therefore, with high probability, Algorithm 1 returns a point ˆ from a distribution with infinity-distance at most Á from fiafter O(1) calls to the sampling and membership oracles. Since Algorithm 1 succeeds with probability 1 2 k after k iterations, after max = 5d log(R r ) + 5LR + Á iterations Algorithm 1 will have succeeded with probability roughly 1 Á( R r ) 5de 5LR. In the very unlikely event that Algorithm 1 still has not succeeded after max iterations, Algorithm 1 simply outputs a point sampled from the uniform distribution on the ball B(0, r) of radius r contained in K. The probability mass of the target distribution fiinside this ball is at least as large as ( R r ) de LR; thus, since f is L-Lipschitz, we show in Corollary B.4 that outputing a sample from the uniform distribution on this ball with probability Á( R r ) 5de 5LR does not change the Œ-distance error of the sample returned by the algorithm by more than Á. 4In Algorithm 1 we reject ˆ with a slightly higher probability to ensure that, in differential privacy applications, in addition to the privacy of the point returned by the algorithm, the runtime is also Á-differentially private. 3.2 Completing the proof of Theorem 2.1 Proof: By Theorem 1, the output of Algorithm 1 has infinity-distance error bounded by Á as long as the input samples have TV distance error bounded by 3R(d log(R/r) + LR)2 and, with high probability, Algorithm 1 requires O(1) such independent samples. To generate a sample from fiwith TV error O( ) when K = { œ Rd : A Æ b} is a polytope defined by m inequalities, we use the Dikin Walk Markov chain of [34]. This Markov chain requires an initial point from some distribution µ0 which is w-warm with respect to the stationary distribution fi, that is, supzœK fi(z) Æ w. To obtain a warm start, we let µ0 be the uniform distribution on the ball with radius r contained in K and sample from µ0. Since f is L-Lipschitz, and K is contained in a ball of radius R, µ0 is w-warm with w Æ 1 Vol(B(0, r)) 3maxzœK fi( ) minzœK fi( ) Vol(B(0, R)) From [34], we have that from this w-warm start the Dikin Walk Markov chain requires at most O((m2d4 + m2d2L2R2) log( w )) steps to generate a sample with TV distance at most from fi, where each step makes one function evaluation and O(mdÊ 1) arithmetic operations. Plugging in the above values of , w the number of Markov chain steps is T = O((m2d3 + m2d L2R2) [LR + d log(Rd + LRd to generate each independent sample with the required TV error O( ). Since the number of independent samples required as input for Algorithm 1 is O(1) w.h.p., the number of arithmetic operations for Algorithm 1 to output a point with at most Á infinity-distance error is O(T mdÊ 1). Finally, we note that in the more general setting where K is a convex body with membership oracle (but not necessarily) a polytope, we can instead use, for instance, the hit-and-run Markov chain of [29] to generate samples from fiwith TV error O( ) in a number of membership and function evaluation oracle calls that is polynomial in d and poly-logarithmic in 1 , R, r. We can then plug this sample into our Algorithm 1 to obtain a sample from fiwith infinity-distance error O(Á) in a number of oracle calls that is (poly)-logarithmic in 1 r and polynomial on d, L, R. (see Remark 2.3). 4 Conclusion, Limitations, and Future Work To the best of our knowledge, this is the first work that presents an algorithm for sampling from logconcave distributions on convex bodies that comes with infinity-distance bounds and whose running time depends logarithmically on 1/Á. Towards this, the main technical contribution is Algorithm 1 (and Theorem 2.2) which achieves this improved dependence on Á by taking as input continuous samples from a convex body with TV bounds and converting them to samples with infinity-distance bounds. On the other hand, our bounds are polynomial in LR, yet there are algorithms for sampling from logconcave distributions fià e f on a convex body in the total variation distance that are polylogarithmic in R and do not assume f to be Lipschitz [29]. Thus, the main open problem that remains is whether one can also obtain running time bounds for sampling in the infinity-distance which are poly-logarithmic in R and do not require f to be Lipschitz. Our main result also has direct applications to differentially private optimization (Corollaries 2.4 and 2.5). Differential privacy is a notion which has been embraced in many technologies in societal contexts where privacy of individuals is a concern. Hence, we see our work to have a potential of positive societal impact and do not foresee any potential negative societal impacts. Acknowledgments and Disclosure of Funding This research was supported in part by NSF CCF-2104528, CCF-1908347, and CCF-2112665 awards. [1] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 156 163, 1991. [2] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Effi- cient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014. [3] Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the Su LQ framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128 138, 2005. [4] Nicolas Brosse, Alain Durmus, Éric Moulines, and Marcelo Pereyra. Sampling from a log- concave distribution with compact support with proximal Langevin Monte Carlo. In Conference on Learning Theory, pages 319 342. PMLR, 2017. [5] Sebastien Bubeck, Ronen Eldan, and Joseph Lehec. Finite-time analysis of projected Langevin Monte Carlo. In Advances in Neural Information Processing Systems, pages 1243 1251. Citeseer, 2015. [6] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. Advances in neural information processing systems, 24:2249 2257, 2011. [7] Kamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal differentially private principal components. Advances in neural information processing systems, 25:989 997, 2012. [8] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Vaidya walk: A sampling algorithm based on the volumetric barrier. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1220 1227. IEEE, 2017. [9] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Algorithmic Learning Theory, pages 186 211. PMLR, 2018. [10] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log- concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017. [11] Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 1988, 2020. [12] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res., 20:73 1, 2019. [13] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Annals of Applied Probability, 27(3):1551 1587, 2017. [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Founda- tions and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. [15] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11 20, 2014. [16] M. E. Dyer and A. M. Frieze. On the complexity of computing the volume of a polyhedron. SIAM Journal on Computing, 17(5):967 974, 1988. [17] Alan Frieze and Ravi Kannan. Log-Sobolev inequalities and sampling from log-concave dis- tributions. The Annals of Applied Probability, 9(1):14 26, 1999. [18] Alan Frieze, Ravi Kannan, and Nick Polson. Sampling from log-concave distributions. The Annals of Applied Probability, pages 812 837, 1994. [19] Arun Ganesh and Kunal Talwar. Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC. Advances in Neural Information Processing Systems, 33, 2020. [20] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 705 714, 2010. [21] He Jia, Aditi Laddha, Yin Tat Lee, and Santosh S. Vempala. Reducing isotropy and volume to KLS: an Oú(n3Â2) volume algorithm. In STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 961 974. ACM, 2021. [22] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998. [23] Ravi Kannan and Hariharan Narayanan. Random walks on polytopes and an affine interior point method for linear programming. Mathematics of Operations Research, 37(1):1 20, 2012. [24] Jonathan Leake, Colin S Mc Swiggen, and Nisheeth K Vishnoi. A polynomial-time algorithm and applications for matrix sampling from Harish-Chandra Itzykson-Zuber densities. In ACM symposium on theory of computing STOC, 2021. [25] Yin Tat Lee and Santosh Vempala. Eldan s stochastic localization and the KLS hyperplane conjecture: An improved lower bound for expansion. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 998 1007. IEEE, 2017. [26] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1115 1121, 2018. [27] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random structures & algorithms, 4(4):359 412, 1993. [28] László Lovász and Santosh Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research, [29] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pages 57 68. IEEE, 2006. [30] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. [31] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly logconcave distributions: Continuous dynamics. The Annals of Applied Probability, in press. [32] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103. IEEE, 2007. [33] Hariharan Narayanan. Randomized interior point methods for sampling and optimization. The Annals of Applied Probability, 26(1):597 641, 2016. [34] Hariharan Narayanan and Alexander Rakhlin. Efficient sampling from time-varying logconcave distributions. The Journal of Machine Learning Research, 18(1):4017 4045, 2017. [35] Sushant Sachdeva and Nisheeth K Vishnoi. The mixing time of the Dikin walk in a polytope a simple proof. Operations Research Letters, 44(5):630 634, 2016. [36] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algo- rithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, 2019. [37] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009. [38] Nisheeth K. Vishnoi. An introduction to Hamiltonian Monte Carlo method for sampling. Co RR, abs/2108.12107, 2021. [39] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer, 2011. [40] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093 3027. PMLR, 2018. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [N/A] Did you include the license to the code and datasets? [N/A] Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] (See Section 4) (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] (See Section 3.2, Appendix B and C) 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main exper- imental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifi- able information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]